[TEST] forwarding/router_bridge_lag.sh started to flake on Monday

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
@ 2024-08-22 15:37 Jakub Kicinski
  2024-08-23 11:28 ` Petr Machata
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2024-08-22 15:37 UTC (permalink / raw)
  To: Petr Machata, Nikolay Aleksandrov, Hangbin Liu; +Cc: netdev@vger.kernel.org

Hi!

Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky
this week. It flaked very occasionally (and in a different way) before:

https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250

There doesn't seem to be any obvious commit that could have caused this.

Any ideas?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
  2024-08-22 15:37 [TEST] forwarding/router_bridge_lag.sh started to flake on Monday Jakub Kicinski
@ 2024-08-23 11:28 ` Petr Machata
  2024-08-23 15:02   ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Machata @ 2024-08-23 11:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Machata, Nikolay Aleksandrov, Hangbin Liu,
	netdev@vger.kernel.org


Jakub Kicinski <kuba@kernel.org> writes:

> Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky
> this week. It flaked very occasionally (and in a different way) before:
>
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250
>
> There doesn't seem to be any obvious commit that could have caused this.

Hmm:
    # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port.

How are the tests isolated, are they each run in their own vng, or are
instances shared? Could it be that the test that runs befor this one
neglects to take a port down?

In one failure case (I don't see further back or my browser would
apparently catch fire) the predecessor was no_forwarding.sh, and indeed
it looks like it raises the ports, but I don't see where it sets them
back down.

Then router-bridge-lag's cleanup downs the ports, and on rerun it
succeeds. The issue would be probabilistic, because no_forwarding does
not always run before this test, and some tests do not care that the
ports are up. If that's the root cause, this should fix it:

From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001
Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com>
From: Petr Machata <petrm@nvidia.com>
Date: Fri, 23 Aug 2024 14:42:48 +0300
Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down
 ports on cleanup
To: <nbu-linux-internal@nvidia.com>

This test neglects to put ports down on cleanup. Fix it.

Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test")
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
 tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh
index af3b398d13f0..9e677aa64a06 100755
--- a/tools/testing/selftests/net/forwarding/no_forwarding.sh
+++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh
@@ -233,6 +233,9 @@ cleanup()
 {
 	pre_cleanup
 
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
 	h2_destroy
 	h1_destroy
 
-- 
2.45.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
  2024-08-23 11:28 ` Petr Machata
@ 2024-08-23 15:02   ` Jakub Kicinski
  2024-08-23 16:13     ` Petr Machata
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2024-08-23 15:02 UTC (permalink / raw)
  To: Petr Machata; +Cc: Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org

On Fri, 23 Aug 2024 13:28:11 +0200 Petr Machata wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
> > Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky
> > this week. It flaked very occasionally (and in a different way) before:
> >
> > https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250
> >
> > There doesn't seem to be any obvious commit that could have caused this.  
> 
> Hmm:
>     # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port.
> 
> How are the tests isolated, are they each run in their own vng, or are
> instances shared? Could it be that the test that runs befor this one
> neglects to take a port down?

Yes, each one has its own VM, but the VM is reused for multiple tests
serially. The "info" file shows which VM was use (thr-id identifies
the worker, vm-id identifies VM within the worker, worker will restart
the VM if it detects a crash).

> In one failure case (I don't see further back or my browser would
> apparently catch fire) the predecessor was no_forwarding.sh, and indeed
> it looks like it raises the ports, but I don't see where it sets them
> back down.
> 
> Then router-bridge-lag's cleanup downs the ports, and on rerun it
> succeeds. The issue would be probabilistic, because no_forwarding does
> not always run before this test, and some tests do not care that the
> ports are up. If that's the root cause, this should fix it:
> 
> From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001
> Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com>
> From: Petr Machata <petrm@nvidia.com>
> Date: Fri, 23 Aug 2024 14:42:48 +0300
> Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down
>  ports on cleanup
> To: <nbu-linux-internal@nvidia.com>
> 
> This test neglects to put ports down on cleanup. Fix it.
> 
> Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test")
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> ---
>  tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh
> index af3b398d13f0..9e677aa64a06 100755
> --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh
> +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh
> @@ -233,6 +233,9 @@ cleanup()
>  {
>  	pre_cleanup
>  
> +	ip link set dev $swp2 down
> +	ip link set dev $swp1 down
> +
>  	h2_destroy
>  	h1_destroy
>  

no_forwarding always runs in thread 0 because it's the slowest tests
and we try to run from the slowest as a basic bin packing heuristic.
Clicking thru the failures I don't see them on thread 0.

But putting the ports down seems like a good cleanup regardless.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
  2024-08-23 15:02   ` Jakub Kicinski
@ 2024-08-23 16:13     ` Petr Machata
  2024-08-24 21:27       ` Jakub Kicinski
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Machata @ 2024-08-23 16:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Machata, Nikolay Aleksandrov, Hangbin Liu,
	netdev@vger.kernel.org


Jakub Kicinski <kuba@kernel.org> writes:

> On Fri, 23 Aug 2024 13:28:11 +0200 Petr Machata wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> 
>> > Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky
>> > this week. It flaked very occasionally (and in a different way) before:
>> >
>> > https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250
>> >
>> > There doesn't seem to be any obvious commit that could have caused this.  
>> 
>> Hmm:
>>     # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port.
>> 
>> How are the tests isolated, are they each run in their own vng, or are
>> instances shared? Could it be that the test that runs befor this one
>> neglects to take a port down?
>
> Yes, each one has its own VM, but the VM is reused for multiple tests
> serially. The "info" file shows which VM was use (thr-id identifies
> the worker, vm-id identifies VM within the worker, worker will restart
> the VM if it detects a crash).

OK, so my guess would be that whatever ran before the test forgot to put
the port down.

>> In one failure case (I don't see further back or my browser would
>> apparently catch fire) the predecessor was no_forwarding.sh, and indeed
>> it looks like it raises the ports, but I don't see where it sets them
>> back down.
>> 
>> Then router-bridge-lag's cleanup downs the ports, and on rerun it
>> succeeds. The issue would be probabilistic, because no_forwarding does
>> not always run before this test, and some tests do not care that the
>> ports are up. If that's the root cause, this should fix it:
>> 
>> From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001
>> Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com>
>> From: Petr Machata <petrm@nvidia.com>
>> Date: Fri, 23 Aug 2024 14:42:48 +0300
>> Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down
>>  ports on cleanup
>> To: <nbu-linux-internal@nvidia.com>
>> 
>> This test neglects to put ports down on cleanup. Fix it.
>> 
>> Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test")
>> Signed-off-by: Petr Machata <petrm@nvidia.com>
>> ---
>>  tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh
>> index af3b398d13f0..9e677aa64a06 100755
>> --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh
>> +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh
>> @@ -233,6 +233,9 @@ cleanup()
>>  {
>>  	pre_cleanup
>>  
>> +	ip link set dev $swp2 down
>> +	ip link set dev $swp1 down
>> +
>>  	h2_destroy
>>  	h1_destroy
>>  
>
> no_forwarding always runs in thread 0 because it's the slowest tests
> and we try to run from the slowest as a basic bin packing heuristic.
> Clicking thru the failures I don't see them on thread 0.

Is there a way to see what ran before?

> But putting the ports down seems like a good cleanup regardless.

I'll send it as a proper patch.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
  2024-08-23 16:13     ` Petr Machata
@ 2024-08-24 21:27       ` Jakub Kicinski
  2024-08-25  9:01         ` Petr Machata
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2024-08-24 21:27 UTC (permalink / raw)
  To: Petr Machata; +Cc: Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org

On Fri, 23 Aug 2024 18:13:01 +0200 Petr Machata wrote:
> >> +	ip link set dev $swp2 down
> >> +	ip link set dev $swp1 down
> >> +
> >>  	h2_destroy
> >>  	h1_destroy
> >>    
> >
> > no_forwarding always runs in thread 0 because it's the slowest tests
> > and we try to run from the slowest as a basic bin packing heuristic.
> > Clicking thru the failures I don't see them on thread 0.  
> 
> Is there a way to see what ran before?

The data is with the outputs in the "info" file, not in the DB :(
I hacked up a bash script to fetch those:
https://github.com/linux-netdev/nipa/blob/main/contest/cithreadmap

Looks like for the failed cases local_termination.sh always runs 
before router-bridge, and whatever runs next flakes:

Thread4-VM0
	 5-local-termination-sh/
	 20-router-bridge-lag-sh/
	 20-router-bridge-lag-sh-retry/

Thread4-VM0
	 5-local-termination-sh/
	 16-router-bridge-1d-lag-sh/
	 16-router-bridge-1d-lag-sh-retry/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday
  2024-08-24 21:27       ` Jakub Kicinski
@ 2024-08-25  9:01         ` Petr Machata
  0 siblings, 0 replies; 6+ messages in thread
From: Petr Machata @ 2024-08-25  9:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Machata, Hangbin Liu, netdev@vger.kernel.org,
	Vladimir Oltean, Petr Machata


Jakub Kicinski <kuba@kernel.org> writes:

> On Fri, 23 Aug 2024 18:13:01 +0200 Petr Machata wrote:
>> >> +	ip link set dev $swp2 down
>> >> +	ip link set dev $swp1 down
>> >> +
>> >>  	h2_destroy
>> >>  	h1_destroy
>> >>    
>> >
>> > no_forwarding always runs in thread 0 because it's the slowest tests
>> > and we try to run from the slowest as a basic bin packing heuristic.
>> > Clicking thru the failures I don't see them on thread 0.  
>> 
>> Is there a way to see what ran before?
>
> The data is with the outputs in the "info" file, not in the DB :(
> I hacked up a bash script to fetch those:
> https://github.com/linux-netdev/nipa/blob/main/contest/cithreadmap

Nice.

> Looks like for the failed cases local_termination.sh always runs 
> before router-bridge, and whatever runs next flakes:
>
> Thread4-VM0
> 	 5-local-termination-sh/
> 	 20-router-bridge-lag-sh/
> 	 20-router-bridge-lag-sh-retry/
>
> Thread4-VM0
> 	 5-local-termination-sh/
> 	 16-router-bridge-1d-lag-sh/
> 	 16-router-bridge-1d-lag-sh-retry/

Looks like a no_forwarding cut'n'paste issue. I'll send a fix on Monday.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-08-25  9:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22 15:37 [TEST] forwarding/router_bridge_lag.sh started to flake on Monday Jakub Kicinski
2024-08-23 11:28 ` Petr Machata
2024-08-23 15:02   ` Jakub Kicinski
2024-08-23 16:13     ` Petr Machata
2024-08-24 21:27       ` Jakub Kicinski
2024-08-25  9:01         ` Petr Machata

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).