* [TEST] forwarding/router_bridge_lag.sh started to flake on Monday @ 2024-08-22 15:37 Jakub Kicinski 2024-08-23 11:28 ` Petr Machata 0 siblings, 1 reply; 6+ messages in thread From: Jakub Kicinski @ 2024-08-22 15:37 UTC (permalink / raw) To: Petr Machata, Nikolay Aleksandrov, Hangbin Liu; +Cc: netdev@vger.kernel.org Hi! Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky this week. It flaked very occasionally (and in a different way) before: https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250 There doesn't seem to be any obvious commit that could have caused this. Any ideas? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday 2024-08-22 15:37 [TEST] forwarding/router_bridge_lag.sh started to flake on Monday Jakub Kicinski @ 2024-08-23 11:28 ` Petr Machata 2024-08-23 15:02 ` Jakub Kicinski 0 siblings, 1 reply; 6+ messages in thread From: Petr Machata @ 2024-08-23 11:28 UTC (permalink / raw) To: Jakub Kicinski Cc: Petr Machata, Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org Jakub Kicinski <kuba@kernel.org> writes: > Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky > this week. It flaked very occasionally (and in a different way) before: > > https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250 > > There doesn't seem to be any obvious commit that could have caused this. Hmm: # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port. How are the tests isolated, are they each run in their own vng, or are instances shared? Could it be that the test that runs befor this one neglects to take a port down? In one failure case (I don't see further back or my browser would apparently catch fire) the predecessor was no_forwarding.sh, and indeed it looks like it raises the ports, but I don't see where it sets them back down. Then router-bridge-lag's cleanup downs the ports, and on rerun it succeeds. The issue would be probabilistic, because no_forwarding does not always run before this test, and some tests do not care that the ports are up. If that's the root cause, this should fix it: From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001 Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com> From: Petr Machata <petrm@nvidia.com> Date: Fri, 23 Aug 2024 14:42:48 +0300 Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down ports on cleanup To: <nbu-linux-internal@nvidia.com> This test neglects to put ports down on cleanup. Fix it. Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test") Signed-off-by: Petr Machata <petrm@nvidia.com> --- tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh index af3b398d13f0..9e677aa64a06 100755 --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh @@ -233,6 +233,9 @@ cleanup() { pre_cleanup + ip link set dev $swp2 down + ip link set dev $swp1 down + h2_destroy h1_destroy -- 2.45.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday 2024-08-23 11:28 ` Petr Machata @ 2024-08-23 15:02 ` Jakub Kicinski 2024-08-23 16:13 ` Petr Machata 0 siblings, 1 reply; 6+ messages in thread From: Jakub Kicinski @ 2024-08-23 15:02 UTC (permalink / raw) To: Petr Machata; +Cc: Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org On Fri, 23 Aug 2024 13:28:11 +0200 Petr Machata wrote: > Jakub Kicinski <kuba@kernel.org> writes: > > > Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky > > this week. It flaked very occasionally (and in a different way) before: > > > > https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250 > > > > There doesn't seem to be any obvious commit that could have caused this. > > Hmm: > # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port. > > How are the tests isolated, are they each run in their own vng, or are > instances shared? Could it be that the test that runs befor this one > neglects to take a port down? Yes, each one has its own VM, but the VM is reused for multiple tests serially. The "info" file shows which VM was use (thr-id identifies the worker, vm-id identifies VM within the worker, worker will restart the VM if it detects a crash). > In one failure case (I don't see further back or my browser would > apparently catch fire) the predecessor was no_forwarding.sh, and indeed > it looks like it raises the ports, but I don't see where it sets them > back down. > > Then router-bridge-lag's cleanup downs the ports, and on rerun it > succeeds. The issue would be probabilistic, because no_forwarding does > not always run before this test, and some tests do not care that the > ports are up. If that's the root cause, this should fix it: > > From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001 > Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com> > From: Petr Machata <petrm@nvidia.com> > Date: Fri, 23 Aug 2024 14:42:48 +0300 > Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down > ports on cleanup > To: <nbu-linux-internal@nvidia.com> > > This test neglects to put ports down on cleanup. Fix it. > > Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test") > Signed-off-by: Petr Machata <petrm@nvidia.com> > --- > tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh > index af3b398d13f0..9e677aa64a06 100755 > --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh > +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh > @@ -233,6 +233,9 @@ cleanup() > { > pre_cleanup > > + ip link set dev $swp2 down > + ip link set dev $swp1 down > + > h2_destroy > h1_destroy > no_forwarding always runs in thread 0 because it's the slowest tests and we try to run from the slowest as a basic bin packing heuristic. Clicking thru the failures I don't see them on thread 0. But putting the ports down seems like a good cleanup regardless. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday 2024-08-23 15:02 ` Jakub Kicinski @ 2024-08-23 16:13 ` Petr Machata 2024-08-24 21:27 ` Jakub Kicinski 0 siblings, 1 reply; 6+ messages in thread From: Petr Machata @ 2024-08-23 16:13 UTC (permalink / raw) To: Jakub Kicinski Cc: Petr Machata, Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org Jakub Kicinski <kuba@kernel.org> writes: > On Fri, 23 Aug 2024 13:28:11 +0200 Petr Machata wrote: >> Jakub Kicinski <kuba@kernel.org> writes: >> >> > Looks like forwarding/router_bridge_lag.sh has gotten a lot more flaky >> > this week. It flaked very occasionally (and in a different way) before: >> > >> > https://netdev.bots.linux.dev/contest.html?executor=vmksft-forwarding&test=router-bridge-lag-sh&ld_cnt=250 >> > >> > There doesn't seem to be any obvious commit that could have caused this. >> >> Hmm: >> # 3.37 [+0.11] Error: Device is up. Set it down before adding it as a team port. >> >> How are the tests isolated, are they each run in their own vng, or are >> instances shared? Could it be that the test that runs befor this one >> neglects to take a port down? > > Yes, each one has its own VM, but the VM is reused for multiple tests > serially. The "info" file shows which VM was use (thr-id identifies > the worker, vm-id identifies VM within the worker, worker will restart > the VM if it detects a crash). OK, so my guess would be that whatever ran before the test forgot to put the port down. >> In one failure case (I don't see further back or my browser would >> apparently catch fire) the predecessor was no_forwarding.sh, and indeed >> it looks like it raises the ports, but I don't see where it sets them >> back down. >> >> Then router-bridge-lag's cleanup downs the ports, and on rerun it >> succeeds. The issue would be probabilistic, because no_forwarding does >> not always run before this test, and some tests do not care that the >> ports are up. If that's the root cause, this should fix it: >> >> From 0baf91dc24b95ae0cadfdf5db05b74888e6a228a Mon Sep 17 00:00:00 2001 >> Message-ID: <0baf91dc24b95ae0cadfdf5db05b74888e6a228a.1724413545.git.petrm@nvidia.com> >> From: Petr Machata <petrm@nvidia.com> >> Date: Fri, 23 Aug 2024 14:42:48 +0300 >> Subject: [PATCH net-next mlxsw] selftests: forwarding: no_forwarding: Down >> ports on cleanup >> To: <nbu-linux-internal@nvidia.com> >> >> This test neglects to put ports down on cleanup. Fix it. >> >> Fixes: 476a4f05d9b8 ("selftests: forwarding: add a no_forwarding.sh test") >> Signed-off-by: Petr Machata <petrm@nvidia.com> >> --- >> tools/testing/selftests/net/forwarding/no_forwarding.sh | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/tools/testing/selftests/net/forwarding/no_forwarding.sh b/tools/testing/selftests/net/forwarding/no_forwarding.sh >> index af3b398d13f0..9e677aa64a06 100755 >> --- a/tools/testing/selftests/net/forwarding/no_forwarding.sh >> +++ b/tools/testing/selftests/net/forwarding/no_forwarding.sh >> @@ -233,6 +233,9 @@ cleanup() >> { >> pre_cleanup >> >> + ip link set dev $swp2 down >> + ip link set dev $swp1 down >> + >> h2_destroy >> h1_destroy >> > > no_forwarding always runs in thread 0 because it's the slowest tests > and we try to run from the slowest as a basic bin packing heuristic. > Clicking thru the failures I don't see them on thread 0. Is there a way to see what ran before? > But putting the ports down seems like a good cleanup regardless. I'll send it as a proper patch. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday 2024-08-23 16:13 ` Petr Machata @ 2024-08-24 21:27 ` Jakub Kicinski 2024-08-25 9:01 ` Petr Machata 0 siblings, 1 reply; 6+ messages in thread From: Jakub Kicinski @ 2024-08-24 21:27 UTC (permalink / raw) To: Petr Machata; +Cc: Nikolay Aleksandrov, Hangbin Liu, netdev@vger.kernel.org On Fri, 23 Aug 2024 18:13:01 +0200 Petr Machata wrote: > >> + ip link set dev $swp2 down > >> + ip link set dev $swp1 down > >> + > >> h2_destroy > >> h1_destroy > >> > > > > no_forwarding always runs in thread 0 because it's the slowest tests > > and we try to run from the slowest as a basic bin packing heuristic. > > Clicking thru the failures I don't see them on thread 0. > > Is there a way to see what ran before? The data is with the outputs in the "info" file, not in the DB :( I hacked up a bash script to fetch those: https://github.com/linux-netdev/nipa/blob/main/contest/cithreadmap Looks like for the failed cases local_termination.sh always runs before router-bridge, and whatever runs next flakes: Thread4-VM0 5-local-termination-sh/ 20-router-bridge-lag-sh/ 20-router-bridge-lag-sh-retry/ Thread4-VM0 5-local-termination-sh/ 16-router-bridge-1d-lag-sh/ 16-router-bridge-1d-lag-sh-retry/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [TEST] forwarding/router_bridge_lag.sh started to flake on Monday 2024-08-24 21:27 ` Jakub Kicinski @ 2024-08-25 9:01 ` Petr Machata 0 siblings, 0 replies; 6+ messages in thread From: Petr Machata @ 2024-08-25 9:01 UTC (permalink / raw) To: Jakub Kicinski Cc: Petr Machata, Hangbin Liu, netdev@vger.kernel.org, Vladimir Oltean, Petr Machata Jakub Kicinski <kuba@kernel.org> writes: > On Fri, 23 Aug 2024 18:13:01 +0200 Petr Machata wrote: >> >> + ip link set dev $swp2 down >> >> + ip link set dev $swp1 down >> >> + >> >> h2_destroy >> >> h1_destroy >> >> >> > >> > no_forwarding always runs in thread 0 because it's the slowest tests >> > and we try to run from the slowest as a basic bin packing heuristic. >> > Clicking thru the failures I don't see them on thread 0. >> >> Is there a way to see what ran before? > > The data is with the outputs in the "info" file, not in the DB :( > I hacked up a bash script to fetch those: > https://github.com/linux-netdev/nipa/blob/main/contest/cithreadmap Nice. > Looks like for the failed cases local_termination.sh always runs > before router-bridge, and whatever runs next flakes: > > Thread4-VM0 > 5-local-termination-sh/ > 20-router-bridge-lag-sh/ > 20-router-bridge-lag-sh-retry/ > > Thread4-VM0 > 5-local-termination-sh/ > 16-router-bridge-1d-lag-sh/ > 16-router-bridge-1d-lag-sh-retry/ Looks like a no_forwarding cut'n'paste issue. I'll send a fix on Monday. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-08-25 9:14 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-22 15:37 [TEST] forwarding/router_bridge_lag.sh started to flake on Monday Jakub Kicinski 2024-08-23 11:28 ` Petr Machata 2024-08-23 15:02 ` Jakub Kicinski 2024-08-23 16:13 ` Petr Machata 2024-08-24 21:27 ` Jakub Kicinski 2024-08-25 9:01 ` Petr Machata
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).