* [TEST] Flake report
@ 2024-05-09 23:09 Jakub Kicinski
2024-05-10 3:24 ` Hangbin Liu
` (6 more replies)
0 siblings, 7 replies; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-09 23:09 UTC (permalink / raw)
To: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts
Cc: netdev
Hi!
Feels like the efforts to get rid of flaky tests have slowed down a bit,
so I thought I'd poke people..
Here's the full list:
https://netdev.bots.linux.dev/flakes.html?min-flip=0&pw-y=0
click on test name to get the list of runs and links to outputs.
As a reminder please see these instructions for repro:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
I'll try to tag folks who touched the tests most recently, but please
don't hesitate to chime in.
net
---
arp-ndisc-untracked-subnets-sh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To: Jaehee Park <jhpark1013@gmail.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Times out on debug kernels, passes on non-debug.
This is a real timeout, eats full 7200 seconds.
xfrm-policy-sh
~~~~~~~~~~~~~~
To: Hangbin Liu <liuhangbin@gmail.com>
Times out on debug kernels, passed on non-debug,
This is a "inactivity" timeout, test doesn't print anything
for 900 seconds so the runner kills it. We can bump the timeout
but not printing for 15min is bad..
cmsg-time-sh
~~~~~~~~~~~~
To: Jakub Kicinski <kuba@kernel.org> (forgot I wrote this :D)
Fails randomly.
pmtu-sh
~~~~~~~
To: Simon Horman <horms@kernel.org>
Skipped because it wants full OVS tooling.
forwarding
----------
sch-tbf-ets-sh, sch-tbf-prio-sh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To: Petr Machata <petrm@nvidia.com>
These fail way too often on non-debug kernels :(
Perhaps we can extend the lower bound?
bridge-igmp-sh, bridge-mld-sh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@nvidia.com>
On debug kernels it always fails with:
# TEST: IGMPv3 group 239.10.10.10 exclude timeout [FAIL]
# Entry 192.0.2.21 has blocked flag failed
For MLD:
# TEST: MLDv2 group ff02::cc exclude timeout [FAIL]
# Entry 2001:db8:1::21 has blocked flag failed
vxlan-bridge-1d-sh
~~~~~~~~~~~~~~~~~~
To: Ido Schimmel <idosch@nvidia.com>
Cc: Petr Machata <petrm@nvidia.com>
Flake fails almost always, with some form of "Expected to capture 0
packets, got $X"
mirror-gre-lag-lacp-sh
~~~~~~~~~~~~~~~~~~~~~~
To: Petr Machata <petrm@nvidia.com>
Often fails on debug with:
# TEST: mirror to gretap: LAG first slave (skip_hw) [FAIL]
# Expected to capture 10 packets, got 13.
mirror-gre-vlan-bridge-1q-sh, mirror-gre-bridge-1d-vlan-sh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To: Petr Machata <petrm@nvidia.com>
Same kind of failure as above but less often and both on debug and non-debug.
tc-actions-sh
~~~~~~~~~~~~~
To: Davide Caratti <dcaratti@redhat.com>
It triggers a random unhandled interrupt, somehow (look at stderr).
It's the only test that does that.
mptcp
-----
To: Matthieu Baerts <matttbe@kernel.org>
simult-flows-sh is still quite flaky :(
nf
--
To: Florian Westphal <fw@strlen.de>
These are skipped because of some compatibility issues:
nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
Please LMK if I need to update the CLI tooling.
Or is this missing kernel config?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
@ 2024-05-10 3:24 ` Hangbin Liu
2024-05-10 8:35 ` Florian Westphal
` (5 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: Hangbin Liu @ 2024-05-10 3:24 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Thu, May 09, 2024 at 04:09:58PM -0700, Jakub Kicinski wrote:
> Hi!
>
> Feels like the efforts to get rid of flaky tests have slowed down a bit,
> so I thought I'd poke people..
>
> Here's the full list:
> https://netdev.bots.linux.dev/flakes.html?min-flip=0&pw-y=0
> click on test name to get the list of runs and links to outputs.
>
> As a reminder please see these instructions for repro:
> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
>
> I'll try to tag folks who touched the tests most recently, but please
> don't hesitate to chime in.
>
>
> net
> ---
>
> arp-ndisc-untracked-subnets-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Jaehee Park <jhpark1013@gmail.com>
> Cc: Hangbin Liu <liuhangbin@gmail.com>
>
> Times out on debug kernels, passes on non-debug.
> This is a real timeout, eats full 7200 seconds.
>
> xfrm-policy-sh
> ~~~~~~~~~~~~~~
> To: Hangbin Liu <liuhangbin@gmail.com>
>
> Times out on debug kernels, passed on non-debug,
> This is a "inactivity" timeout, test doesn't print anything
> for 900 seconds so the runner kills it. We can bump the timeout
> but not printing for 15min is bad..
Got it, I will check these 2 cases.
Thanks
Hangbin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
2024-05-10 3:24 ` Hangbin Liu
@ 2024-05-10 8:35 ` Florian Westphal
2024-05-10 14:47 ` Jakub Kicinski
2024-05-10 14:28 ` Matthieu Baerts
` (4 subsequent siblings)
6 siblings, 1 reply; 20+ messages in thread
From: Florian Westphal @ 2024-05-10 8:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Jakub Kicinski <kuba@kernel.org> wrote:
> To: Florian Westphal <fw@strlen.de>
>
> These are skipped because of some compatibility issues:
>
> nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
>
> Please LMK if I need to update the CLI tooling.
> Or is this missing kernel config?
No, its related to the userspace tooling.
This should start to work once amazon linux updates nftables.
bridge-brouter-sh would work with the old ebtables-legacy instead
of ebtables-nft, or a more recent version of ebtables-nft.
ATM it uses a version of ebtables-nft that lacks "broute" table emulation.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
2024-05-10 3:24 ` Hangbin Liu
2024-05-10 8:35 ` Florian Westphal
@ 2024-05-10 14:28 ` Matthieu Baerts
2024-05-30 17:35 ` Matthieu Baerts
2024-05-10 14:45 ` Nikolay Aleksandrov
` (3 subsequent siblings)
6 siblings, 1 reply; 20+ messages in thread
From: Matthieu Baerts @ 2024-05-10 14:28 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti
Hi Jakub,
Thank you for this reminder!
On 10/05/2024 01:09, Jakub Kicinski wrote:
(...)
> mptcp
> -----
> To: Matthieu Baerts <matttbe@kernel.org>
>
> simult-flows-sh is still quite flaky :(
Yes, we need to find a solution for that. It is not as unstable on our
side [1]. We will look at that next week. If we cannot find a solution
quickly, we will skip the flaky subtests to stop the noise while
continuing to investigate.
[1] https://ci-results.mptcp.dev/flakes.html
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
` (2 preceding siblings ...)
2024-05-10 14:28 ` Matthieu Baerts
@ 2024-05-10 14:45 ` Nikolay Aleksandrov
2024-05-11 13:27 ` Simon Horman
` (2 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2024-05-10 14:45 UTC (permalink / raw)
To: Jakub Kicinski, Florian Westphal, Simon Horman, Hangbin Liu,
Jaehee Park, Petr Machata, Ido Schimmel, Davide Caratti,
Matthieu Baerts
Cc: netdev
On 10/05/2024 02:09, Jakub Kicinski wrote:
> Hi!
>
> Feels like the efforts to get rid of flaky tests have slowed down a bit,
> so I thought I'd poke people..
>
> Here's the full list:
> https://netdev.bots.linux.dev/flakes.html?min-flip=0&pw-y=0
> click on test name to get the list of runs and links to outputs.
>
> As a reminder please see these instructions for repro:
> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
>
> I'll try to tag folks who touched the tests most recently, but please
> don't hesitate to chime in.
>
>
[snip]
> bridge-igmp-sh, bridge-mld-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Nikolay Aleksandrov <razor@blackwall.org>
> Cc: Ido Schimmel <idosch@nvidia.com>
>
> On debug kernels it always fails with:
>
> # TEST: IGMPv3 group 239.10.10.10 exclude timeout [FAIL]
> # Entry 192.0.2.21 has blocked flag failed
>
> For MLD:
>
> # TEST: MLDv2 group ff02::cc exclude timeout [FAIL]
> # Entry 2001:db8:1::21 has blocked flag failed
>
I think the problem is the short timeout on slower (debug) runs. Perhaps increasing
it from 3 to 5 seconds would be enough to cover the 2 second waits for setup and
the verifications being done. I'll give it a go and post a patch.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 8:35 ` Florian Westphal
@ 2024-05-10 14:47 ` Jakub Kicinski
2024-05-10 16:03 ` Jakub Kicinski
0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-10 14:47 UTC (permalink / raw)
To: Florian Westphal
Cc: Simon Horman, Hangbin Liu, Jaehee Park, Petr Machata,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
On Fri, 10 May 2024 10:35:51 +0200 Florian Westphal wrote:
> Jakub Kicinski <kuba@kernel.org> wrote:
> > To: Florian Westphal <fw@strlen.de>
> >
> > These are skipped because of some compatibility issues:
> >
> > nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
> >
> > Please LMK if I need to update the CLI tooling.
> > Or is this missing kernel config?
>
> No, its related to the userspace tooling.
> This should start to work once amazon linux updates nftables.
>
> bridge-brouter-sh would work with the old ebtables-legacy instead
> of ebtables-nft, or a more recent version of ebtables-nft.
>
> ATM it uses a version of ebtables-nft that lacks "broute" table emulation.
Amazon Linux is more of a base OS for loading containers it seems.
I build pretty much all the tools from source.
So I just built nft too.. Whether it will actually work we'll find
out in about 15 min :)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 14:47 ` Jakub Kicinski
@ 2024-05-10 16:03 ` Jakub Kicinski
2024-05-10 16:41 ` Florian Westphal
0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-10 16:03 UTC (permalink / raw)
To: Florian Westphal
Cc: Simon Horman, Hangbin Liu, Jaehee Park, Petr Machata,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
On Fri, 10 May 2024 07:47:16 -0700 Jakub Kicinski wrote:
> On Fri, 10 May 2024 10:35:51 +0200 Florian Westphal wrote:
> > Jakub Kicinski <kuba@kernel.org> wrote:
> > > To: Florian Westphal <fw@strlen.de>
> > >
> > > These are skipped because of some compatibility issues:
> > >
> > > nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
> > >
> > > Please LMK if I need to update the CLI tooling.
> > > Or is this missing kernel config?
> >
> > No, its related to the userspace tooling.
> > This should start to work once amazon linux updates nftables.
> >
> > bridge-brouter-sh would work with the old ebtables-legacy instead
> > of ebtables-nft, or a more recent version of ebtables-nft.
> >
> > ATM it uses a version of ebtables-nft that lacks "broute" table emulation.
>
> Amazon Linux is more of a base OS for loading containers it seems.
> I build pretty much all the tools from source.
>
> So I just built nft too.. Whether it will actually work we'll find
> out in about 15 min :)
M. Looks like that didn't do anything.
I tried to investigate nft_audit.sh
https://netdev-3.bots.linux.dev/vmksft-nf/results/589221/22-nft-audit-sh/stdout
# selftests: net/netfilter: nft_audit.sh
# SKIP: nft reset feature test failed: nftables v1.0.9 (Old Doc Yak #3)
ok 1 selftests: net/netfilter: nft_audit.sh # SKIP
This is what it hits:
bash-5.2# nft -v
nftables v1.0.9 (Old Doc Yak #3)
bash-5.2# nft --check -f /dev/stdin <<EOF
add table t
add chain t c
reset rules t c
EOF
/dev/stdin:3:7-11: Error: syntax error, unexpected string, expecting counter or counters or quotas or quota
reset rules t c
^^^^^
What does that mean in lay terms?
Question #2, for the ebtables test - do I need to build iptables?
I built nft with
./configure --with-json --with-xtables
but no xtables-nft-multi popped out.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 16:03 ` Jakub Kicinski
@ 2024-05-10 16:41 ` Florian Westphal
2024-05-10 18:02 ` Jakub Kicinski
0 siblings, 1 reply; 20+ messages in thread
From: Florian Westphal @ 2024-05-10 16:41 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Jakub Kicinski <kuba@kernel.org> wrote:
> M. Looks like that didn't do anything.
>
> I tried to investigate nft_audit.sh
>
> https://netdev-3.bots.linux.dev/vmksft-nf/results/589221/22-nft-audit-sh/stdout
>
> # selftests: net/netfilter: nft_audit.sh
> # SKIP: nft reset feature test failed: nftables v1.0.9 (Old Doc Yak #3)
> ok 1 selftests: net/netfilter: nft_audit.sh # SKIP
>
> This is what it hits:
>
> bash-5.2# nft -v
> nftables v1.0.9 (Old Doc Yak #3)
> bash-5.2# nft --check -f /dev/stdin <<EOF
> add table t
> add chain t c
> reset rules t c
> EOF
> /dev/stdin:3:7-11: Error: syntax error, unexpected string, expecting counter or counters or quotas or quota
> reset rules t c
> ^^^^^
>
> What does that mean in lay terms?
This nft version chokes on syntax, but I cannot reproduce this:
src/nft --check -f /dev/stdin <<EOF
add table t
add chain t c
reset rules t c
EOF
echo $?
table ip t {
chain c { }
}
0
src/nft --version
nftables v1.0.9 (Old Doc Yak #3)
No idea :-(
I tried building both recent nftables.git and v1.0.9 tag and both
parse the test file for me :-(
Also. nft-flowtable.sh is still not working on nf infra even
with the updated version while that script works fine locally for me
as well, even with running via vng.
Maybe there is an old libnftables on the system that is used
instead for parsing? Its bundled/installed with nftables, can
you check that ldd nft doesnt show some other distro-installed
version? Other than that I have no idea what could be happening here.
> Question #2, for the ebtables test - do I need to build iptables?
> I built nft with
> ./configure --with-json --with-xtables
You need to add --enable-nftables for ebtables-nft, or you need to
use the old ebtables tree, i.e.:
https://git.netfilter.org/ebtables/
both should work.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 16:41 ` Florian Westphal
@ 2024-05-10 18:02 ` Jakub Kicinski
2024-05-11 0:14 ` Jakub Kicinski
0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-10 18:02 UTC (permalink / raw)
To: Florian Westphal
Cc: Simon Horman, Hangbin Liu, Jaehee Park, Petr Machata,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
On Fri, 10 May 2024 18:41:47 +0200 Florian Westphal wrote:
> > What does that mean in lay terms?
>
> This nft version chokes on syntax, but I cannot reproduce this:
> src/nft --check -f /dev/stdin <<EOF
> add table t
> add chain t c
> reset rules t c
> EOF
> echo $?
> table ip t {
> chain c { }
> }
> 0
> src/nft --version
> nftables v1.0.9 (Old Doc Yak #3)
>
> No idea :-(
>
> I tried building both recent nftables.git and v1.0.9 tag and both
> parse the test file for me :-(
>
> Also. nft-flowtable.sh is still not working on nf infra even
> with the updated version while that script works fine locally for me
> as well, even with running via vng.
>
> Maybe there is an old libnftables on the system that is used
> instead for parsing? Its bundled/installed with nftables, can
> you check that ldd nft doesnt show some other distro-installed
> version? Other than that I have no idea what could be happening here.
Good call! The LD_LIBRARY_PATH was including things in wrong order.
I change that for the next run.
> > Question #2, for the ebtables test - do I need to build iptables?
> > I built nft with
> > ./configure --with-json --with-xtables
>
> You need to add --enable-nftables for ebtables-nft, or you need to
> use the old ebtables tree, i.e.:
> https://git.netfilter.org/ebtables/
>
> both should work.
Picked the old tree. Let's see..
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 18:02 ` Jakub Kicinski
@ 2024-05-11 0:14 ` Jakub Kicinski
2024-05-11 6:50 ` Florian Westphal
0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-11 0:14 UTC (permalink / raw)
To: Florian Westphal
Cc: Simon Horman, Hangbin Liu, Jaehee Park, Petr Machata,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
On Fri, 10 May 2024 11:02:43 -0700 Jakub Kicinski wrote:
> > You need to add --enable-nftables for ebtables-nft, or you need to
> > use the old ebtables tree, i.e.:
> > https://git.netfilter.org/ebtables/
> >
> > both should work.
>
> Picked the old tree. Let's see..
Looks like that worked!!
So the last fail we see for netfilter is nft-flowtable-sh with kernel
debug enabled:
https://netdev.bots.linux.dev/contest.html?executor=vmksft-nf-dbg&test=nft-flowtable-sh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-11 0:14 ` Jakub Kicinski
@ 2024-05-11 6:50 ` Florian Westphal
0 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2024-05-11 6:50 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Jakub Kicinski <kuba@kernel.org> wrote:
> > Picked the old tree. Let's see..
>
> Looks like that worked!!
Great, thanks a lot!
> So the last fail we see for netfilter is nft-flowtable-sh with kernel
> debug enabled:
>
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-nf-dbg&test=nft-flowtable-sh
I'd guess socat gets killed off before it hits EOF, I sent a patch to bump the
timeout to 1m. Lets see if thats enough to make it fly.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
` (3 preceding siblings ...)
2024-05-10 14:45 ` Nikolay Aleksandrov
@ 2024-05-11 13:27 ` Simon Horman
2024-05-14 13:52 ` Aaron Conole
2024-05-13 11:58 ` Davide Caratti
2024-05-13 16:52 ` Petr Machata
6 siblings, 1 reply; 20+ messages in thread
From: Simon Horman @ 2024-05-11 13:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Hangbin Liu, Jaehee Park, Petr Machata,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev, Aaron Conole
+ Aaron
On Thu, May 09, 2024 at 04:09:58PM -0700, Jakub Kicinski wrote:
> Hi!
>
> Feels like the efforts to get rid of flaky tests have slowed down a bit,
> so I thought I'd poke people..
>
> Here's the full list:
> https://netdev.bots.linux.dev/flakes.html?min-flip=0&pw-y=0
> click on test name to get the list of runs and links to outputs.
>
> As a reminder please see these instructions for repro:
> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
>
> I'll try to tag folks who touched the tests most recently, but please
> don't hesitate to chime in.
>
>
> net
> ---
>
> arp-ndisc-untracked-subnets-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Jaehee Park <jhpark1013@gmail.com>
> Cc: Hangbin Liu <liuhangbin@gmail.com>
>
> Times out on debug kernels, passes on non-debug.
> This is a real timeout, eats full 7200 seconds.
>
> xfrm-policy-sh
> ~~~~~~~~~~~~~~
> To: Hangbin Liu <liuhangbin@gmail.com>
>
> Times out on debug kernels, passed on non-debug,
> This is a "inactivity" timeout, test doesn't print anything
> for 900 seconds so the runner kills it. We can bump the timeout
> but not printing for 15min is bad..
>
> cmsg-time-sh
> ~~~~~~~~~~~~
> To: Jakub Kicinski <kuba@kernel.org> (forgot I wrote this :D)
>
> Fails randomly.
>
> pmtu-sh
> ~~~~~~~
> To: Simon Horman <horms@kernel.org>
>
> Skipped because it wants full OVS tooling.
My understanding is that Aaron (CCed) is working on addressing
this problem by allowing the test to run without full OVS tooling.
> forwarding
> ----------
>
> sch-tbf-ets-sh, sch-tbf-prio-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> These fail way too often on non-debug kernels :(
> Perhaps we can extend the lower bound?
>
> bridge-igmp-sh, bridge-mld-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Nikolay Aleksandrov <razor@blackwall.org>
> Cc: Ido Schimmel <idosch@nvidia.com>
>
> On debug kernels it always fails with:
>
> # TEST: IGMPv3 group 239.10.10.10 exclude timeout [FAIL]
> # Entry 192.0.2.21 has blocked flag failed
>
> For MLD:
>
> # TEST: MLDv2 group ff02::cc exclude timeout [FAIL]
> # Entry 2001:db8:1::21 has blocked flag failed
>
> vxlan-bridge-1d-sh
> ~~~~~~~~~~~~~~~~~~
> To: Ido Schimmel <idosch@nvidia.com>
> Cc: Petr Machata <petrm@nvidia.com>
>
> Flake fails almost always, with some form of "Expected to capture 0
> packets, got $X"
>
> mirror-gre-lag-lacp-sh
> ~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> Often fails on debug with:
>
> # TEST: mirror to gretap: LAG first slave (skip_hw) [FAIL]
> # Expected to capture 10 packets, got 13.
>
> mirror-gre-vlan-bridge-1q-sh, mirror-gre-bridge-1d-vlan-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> Same kind of failure as above but less often and both on debug and non-debug.
>
> tc-actions-sh
> ~~~~~~~~~~~~~
> To: Davide Caratti <dcaratti@redhat.com>
>
> It triggers a random unhandled interrupt, somehow (look at stderr).
> It's the only test that does that.
>
>
> mptcp
> -----
> To: Matthieu Baerts <matttbe@kernel.org>
>
> simult-flows-sh is still quite flaky :(
>
>
> nf
> --
> To: Florian Westphal <fw@strlen.de>
>
> These are skipped because of some compatibility issues:
>
> nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
>
> Please LMK if I need to update the CLI tooling.
> Or is this missing kernel config?
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
` (4 preceding siblings ...)
2024-05-11 13:27 ` Simon Horman
@ 2024-05-13 11:58 ` Davide Caratti
2024-05-13 16:52 ` Petr Machata
6 siblings, 0 replies; 20+ messages in thread
From: Davide Caratti @ 2024-05-13 11:58 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Matthieu Baerts,
netdev
On Thu, May 09, 2024 at 04:09:58PM -0700, Jakub Kicinski wrote:
> Hi!
>
> tc-actions-sh
> ~~~~~~~~~~~~~
> To: Davide Caratti <dcaratti@redhat.com>
>
> It triggers a random unhandled interrupt, somehow (look at stderr).
> It's the only test that does that.
wow, no idea why it produces this. I'll try to reproduce and let you know.
thanks,
--
davide
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
` (5 preceding siblings ...)
2024-05-13 11:58 ` Davide Caratti
@ 2024-05-13 16:52 ` Petr Machata
2024-05-14 13:43 ` Jakub Kicinski
2024-05-21 16:29 ` Petr Machata
6 siblings, 2 replies; 20+ messages in thread
From: Petr Machata @ 2024-05-13 16:52 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Jakub Kicinski <kuba@kernel.org> writes:
> sch-tbf-ets-sh, sch-tbf-prio-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> These fail way too often on non-debug kernels :(
> Perhaps we can extend the lower bound?
Hm, it sometimes goes even below -10%. It looks like we'd need to go as
low as -15%.
> vxlan-bridge-1d-sh
> ~~~~~~~~~~~~~~~~~~
> To: Ido Schimmel <idosch@nvidia.com>
> Cc: Petr Machata <petrm@nvidia.com>
>
> Flake fails almost always, with some form of "Expected to capture 0
> packets, got $X"
>
> mirror-gre-lag-lacp-sh
> ~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> Often fails on debug with:
>
> # TEST: mirror to gretap: LAG first slave (skip_hw) [FAIL]
> # Expected to capture 10 packets, got 13.
>
> mirror-gre-vlan-bridge-1q-sh, mirror-gre-bridge-1d-vlan-sh
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> To: Petr Machata <petrm@nvidia.com>
>
> Same kind of failure as above but less often and both on debug and non-debug.
I'll look into these.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-13 16:52 ` Petr Machata
@ 2024-05-14 13:43 ` Jakub Kicinski
2024-05-21 16:29 ` Petr Machata
1 sibling, 0 replies; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-14 13:43 UTC (permalink / raw)
To: Petr Machata
Cc: Florian Westphal, Simon Horman, Hangbin Liu, Jaehee Park,
Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
On Mon, 13 May 2024 18:52:25 +0200 Petr Machata wrote:
> > sch-tbf-ets-sh, sch-tbf-prio-sh
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > To: Petr Machata <petrm@nvidia.com>
> >
> > These fail way too often on non-debug kernels :(
> > Perhaps we can extend the lower bound?
>
> Hm, it sometimes goes even below -10%. It looks like we'd need to go as
> low as -15%.
A more crazy idea would be to run a low prio stress program while
the test is running. I'm guessing that perf is low because VM gets
scheduled out and doesn't get scheduled in in time. Or we can try
to increase the burst size in TBF?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-11 13:27 ` Simon Horman
@ 2024-05-14 13:52 ` Aaron Conole
0 siblings, 0 replies; 20+ messages in thread
From: Aaron Conole @ 2024-05-14 13:52 UTC (permalink / raw)
To: Simon Horman
Cc: Jakub Kicinski, Florian Westphal, Hangbin Liu, Jaehee Park,
Petr Machata, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Simon Horman <horms@kernel.org> writes:
> + Aaron
>
> On Thu, May 09, 2024 at 04:09:58PM -0700, Jakub Kicinski wrote:
>> Hi!
>>
>> Feels like the efforts to get rid of flaky tests have slowed down a bit,
>> so I thought I'd poke people..
>>
>> Here's the full list:
>> https://netdev.bots.linux.dev/flakes.html?min-flip=0&pw-y=0
>> click on test name to get the list of runs and links to outputs.
>>
>> As a reminder please see these instructions for repro:
>> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style
>>
>> I'll try to tag folks who touched the tests most recently, but please
>> don't hesitate to chime in.
>>
>>
>> net
>> ---
>>
>> arp-ndisc-untracked-subnets-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Jaehee Park <jhpark1013@gmail.com>
>> Cc: Hangbin Liu <liuhangbin@gmail.com>
>>
>> Times out on debug kernels, passes on non-debug.
>> This is a real timeout, eats full 7200 seconds.
>>
>> xfrm-policy-sh
>> ~~~~~~~~~~~~~~
>> To: Hangbin Liu <liuhangbin@gmail.com>
>>
>> Times out on debug kernels, passed on non-debug,
>> This is a "inactivity" timeout, test doesn't print anything
>> for 900 seconds so the runner kills it. We can bump the timeout
>> but not printing for 15min is bad..
>>
>> cmsg-time-sh
>> ~~~~~~~~~~~~
>> To: Jakub Kicinski <kuba@kernel.org> (forgot I wrote this :D)
>>
>> Fails randomly.
>>
>> pmtu-sh
>> ~~~~~~~
>> To: Simon Horman <horms@kernel.org>
>>
>> Skipped because it wants full OVS tooling.
>
> My understanding is that Aaron (CCed) is working on addressing
> this problem by allowing the test to run without full OVS tooling.
Yes.
>> forwarding
>> ----------
>>
>> sch-tbf-ets-sh, sch-tbf-prio-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> These fail way too often on non-debug kernels :(
>> Perhaps we can extend the lower bound?
>>
>> bridge-igmp-sh, bridge-mld-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Nikolay Aleksandrov <razor@blackwall.org>
>> Cc: Ido Schimmel <idosch@nvidia.com>
>>
>> On debug kernels it always fails with:
>>
>> # TEST: IGMPv3 group 239.10.10.10 exclude timeout [FAIL]
>> # Entry 192.0.2.21 has blocked flag failed
>>
>> For MLD:
>>
>> # TEST: MLDv2 group ff02::cc exclude timeout [FAIL]
>> # Entry 2001:db8:1::21 has blocked flag failed
>>
>> vxlan-bridge-1d-sh
>> ~~~~~~~~~~~~~~~~~~
>> To: Ido Schimmel <idosch@nvidia.com>
>> Cc: Petr Machata <petrm@nvidia.com>
>>
>> Flake fails almost always, with some form of "Expected to capture 0
>> packets, got $X"
>>
>> mirror-gre-lag-lacp-sh
>> ~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> Often fails on debug with:
>>
>> # TEST: mirror to gretap: LAG first slave (skip_hw) [FAIL]
>> # Expected to capture 10 packets, got 13.
>>
>> mirror-gre-vlan-bridge-1q-sh, mirror-gre-bridge-1d-vlan-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> Same kind of failure as above but less often and both on debug and non-debug.
>>
>> tc-actions-sh
>> ~~~~~~~~~~~~~
>> To: Davide Caratti <dcaratti@redhat.com>
>>
>> It triggers a random unhandled interrupt, somehow (look at stderr).
>> It's the only test that does that.
>>
>>
>> mptcp
>> -----
>> To: Matthieu Baerts <matttbe@kernel.org>
>>
>> simult-flows-sh is still quite flaky :(
>>
>>
>> nf
>> --
>> To: Florian Westphal <fw@strlen.de>
>>
>> These are skipped because of some compatibility issues:
>>
>> nft-flowtable-sh, bridge-brouter-sh, nft-audit-sh
>>
>> Please LMK if I need to update the CLI tooling.
>> Or is this missing kernel config?
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-13 16:52 ` Petr Machata
2024-05-14 13:43 ` Jakub Kicinski
@ 2024-05-21 16:29 ` Petr Machata
1 sibling, 0 replies; 20+ messages in thread
From: Petr Machata @ 2024-05-21 16:29 UTC (permalink / raw)
To: Petr Machata
Cc: Jakub Kicinski, Florian Westphal, Simon Horman, Hangbin Liu,
Jaehee Park, Nikolay Aleksandrov, Ido Schimmel, Davide Caratti,
Matthieu Baerts, netdev
Petr Machata <petrm@nvidia.com> writes:
> Jakub Kicinski <kuba@kernel.org> writes:
>
>> sch-tbf-ets-sh, sch-tbf-prio-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> These fail way too often on non-debug kernels :(
>> Perhaps we can extend the lower bound?
>
> Hm, it sometimes goes even below -10%. It looks like we'd need to go as
> low as -15%.
>
>> vxlan-bridge-1d-sh
>> ~~~~~~~~~~~~~~~~~~
>> To: Ido Schimmel <idosch@nvidia.com>
>> Cc: Petr Machata <petrm@nvidia.com>
>>
>> Flake fails almost always, with some form of "Expected to capture 0
>> packets, got $X"
>>
>> mirror-gre-lag-lacp-sh
>> ~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> Often fails on debug with:
>>
>> # TEST: mirror to gretap: LAG first slave (skip_hw) [FAIL]
>> # Expected to capture 10 packets, got 13.
>>
>> mirror-gre-vlan-bridge-1q-sh, mirror-gre-bridge-1d-vlan-sh
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> To: Petr Machata <petrm@nvidia.com>
>>
>> Same kind of failure as above but less often and both on debug and non-debug.
>
> I'll look into these.
I fixed mirror-gre-lag-lacp, but the whole mirroring suite is a glorious
mess. In ancient past it used to use ping. These days it uses MZ, but it
still relies on the fact that ICMP packets are sent, including responses
these elicit, and makes all sorts of assumptions around that. And then a
router advertisement comes along and throws the counting out the window.
I'm moving it all over to UDP, but it'll take a bit.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-10 14:28 ` Matthieu Baerts
@ 2024-05-30 17:35 ` Matthieu Baerts
2024-05-30 17:41 ` Jakub Kicinski
0 siblings, 1 reply; 20+ messages in thread
From: Matthieu Baerts @ 2024-05-30 17:35 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, MPTCP Upstream
Hi Jakub,
(+ MPTCP ML, - authors of other unstable selftests)
On 10/05/2024 16:28, Matthieu Baerts wrote:
> On 10/05/2024 01:09, Jakub Kicinski wrote:
>
> (...)
>
>> mptcp
>> -----
>> To: Matthieu Baerts <matttbe@kernel.org>
>>
>> simult-flows-sh is still quite flaky :(
>
> Yes, we need to find a solution for that. It is not as unstable on our
> side [1]. We will look at that next week. If we cannot find a solution
> quickly, we will skip the flaky subtests to stop the noise while
> continuing to investigate.
Now that the flaky MPTCP subtests results have been ignored, the results
look better:
https://netdev.bots.linux.dev/flakes.html?br-cnt=88&tn-needle=mptcp
Do you think we could also stop ignoring them on NIPA side?
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-30 17:35 ` Matthieu Baerts
@ 2024-05-30 17:41 ` Jakub Kicinski
2024-05-31 7:52 ` Matthieu Baerts
0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2024-05-30 17:41 UTC (permalink / raw)
To: Matthieu Baerts; +Cc: netdev, MPTCP Upstream
On Thu, 30 May 2024 19:35:56 +0200 Matthieu Baerts wrote:
> > Yes, we need to find a solution for that. It is not as unstable on our
> > side [1]. We will look at that next week. If we cannot find a solution
> > quickly, we will skip the flaky subtests to stop the noise while
> > continuing to investigate.
> Now that the flaky MPTCP subtests results have been ignored, the results
> look better:
>
> https://netdev.bots.linux.dev/flakes.html?br-cnt=88&tn-needle=mptcp
>
> Do you think we could also stop ignoring them on NIPA side?
Thanks for take care of it, done!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [TEST] Flake report
2024-05-30 17:41 ` Jakub Kicinski
@ 2024-05-31 7:52 ` Matthieu Baerts
0 siblings, 0 replies; 20+ messages in thread
From: Matthieu Baerts @ 2024-05-31 7:52 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, MPTCP Upstream
Hi Jakub,
On 30/05/2024 19:41, Jakub Kicinski wrote:
> On Thu, 30 May 2024 19:35:56 +0200 Matthieu Baerts wrote:
>>> Yes, we need to find a solution for that. It is not as unstable on our
>>> side [1]. We will look at that next week. If we cannot find a solution
>>> quickly, we will skip the flaky subtests to stop the noise while
>>> continuing to investigate.
>> Now that the flaky MPTCP subtests results have been ignored, the results
>> look better:
>>
>> https://netdev.bots.linux.dev/flakes.html?br-cnt=88&tn-needle=mptcp
>>
>> Do you think we could also stop ignoring them on NIPA side?
>
> Thanks for take care of it, done!
Thank you for the modification!
That could have been predicted: just after having removed them from the
list, these tests appeared to be unstable again! But I guess it is
because there is a conflict between -net and net-next [1], and the
"fixes" that are currently only in -net are no longer included in what
is being validated.
Sorry for the troubles! :)
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2024-05-31 7:52 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-09 23:09 [TEST] Flake report Jakub Kicinski
2024-05-10 3:24 ` Hangbin Liu
2024-05-10 8:35 ` Florian Westphal
2024-05-10 14:47 ` Jakub Kicinski
2024-05-10 16:03 ` Jakub Kicinski
2024-05-10 16:41 ` Florian Westphal
2024-05-10 18:02 ` Jakub Kicinski
2024-05-11 0:14 ` Jakub Kicinski
2024-05-11 6:50 ` Florian Westphal
2024-05-10 14:28 ` Matthieu Baerts
2024-05-30 17:35 ` Matthieu Baerts
2024-05-30 17:41 ` Jakub Kicinski
2024-05-31 7:52 ` Matthieu Baerts
2024-05-10 14:45 ` Nikolay Aleksandrov
2024-05-11 13:27 ` Simon Horman
2024-05-14 13:52 ` Aaron Conole
2024-05-13 11:58 ` Davide Caratti
2024-05-13 16:52 ` Petr Machata
2024-05-14 13:43 ` Jakub Kicinski
2024-05-21 16:29 ` Petr Machata
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).