* [TEST] bond_macvlan_ipvlan.sh flakiness
@ 2025-11-14 16:20 Jakub Kicinski
2025-11-17 8:24 ` Hangbin Liu
0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-14 16:20 UTC (permalink / raw)
To: Hangbin Liu; +Cc: netdev
Hi Hangbin!
The flakiness of bond_macvlan_ipvlan.sh has increased quite a lot
recently. Not sure if there's any correlation with kernel changes,
I didn't spot anything in bonding itself. Here's the history of runs:
https://netdev.bots.linux.dev/contest.html?executor=vmksft-bonding&test=bond-macvlan-ipvlan-sh
It looks like it's gotten much worse starting around the 9th?
Only the non-debug kernel build is flaking, debug builds are completely
clear:
https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=bond-macvlan-ipvlan-sh
A few things that stood out to me, all the failures are like this:
# TEST: balance-$lb/$$$vlan_bridge: IPv4: client->$$$vlan_2 [FAIL]
Always IPv4 ping to the second interface, always fails neighbor
resolution:
# 192.0.2.12 dev eth0 FAILED
If it's ipvlan that fails rather than macvlan there is a bunch of
otherhost drops:
# 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
# link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
# RX: bytes packets errors dropped missed mcast
# 702 10 0 0 0 3
# RX errors: length crc frame fifo overrun otherhost
# 0 0 0 0 0 4
FWIW here's the contents of the branches if you want to look thru:
https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-2025-11-09--12-00.html
but 9th was the weekend, and the failure just got more frequent,
we've been trying to track this down for a while..
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-14 16:20 [TEST] bond_macvlan_ipvlan.sh flakiness Jakub Kicinski
@ 2025-11-17 8:24 ` Hangbin Liu
2025-11-18 6:03 ` Hangbin Liu
0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-17 8:24 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Fri, Nov 14, 2025 at 08:20:14AM -0800, Jakub Kicinski wrote:
> Hi Hangbin!
>
> The flakiness of bond_macvlan_ipvlan.sh has increased quite a lot
> recently. Not sure if there's any correlation with kernel changes,
> I didn't spot anything in bonding itself. Here's the history of runs:
>
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-bonding&test=bond-macvlan-ipvlan-sh
>
> It looks like it's gotten much worse starting around the 9th?
>
> Only the non-debug kernel build is flaking, debug builds are completely
> clear:
>
> https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=bond-macvlan-ipvlan-sh
>
> A few things that stood out to me, all the failures are like this:
>
> # TEST: balance-$lb/$$$vlan_bridge: IPv4: client->$$$vlan_2 [FAIL]
>
> Always IPv4 ping to the second interface, always fails neighbor
> resolution:
>
> # 192.0.2.12 dev eth0 FAILED
>
> If it's ipvlan that fails rather than macvlan there is a bunch of
> otherhost drops:
>
> # 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> # link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
> # RX: bytes packets errors dropped missed mcast
> # 702 10 0 0 0 3
> # RX errors: length crc frame fifo overrun otherhost
> # 0 0 0 0 0 4
Hmm, this one is suspicious. I can reproduce the ping fail on local.
But no "otherhost" issue. I will check the failure recently.
Thanks
Hangbin
>
> FWIW here's the contents of the branches if you want to look thru:
> https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-2025-11-09--12-00.html
> but 9th was the weekend, and the failure just got more frequent,
> we've been trying to track this down for a while..
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-17 8:24 ` Hangbin Liu
@ 2025-11-18 6:03 ` Hangbin Liu
2025-11-18 15:13 ` Jakub Kicinski
0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-18 6:03 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Mon, Nov 17, 2025 at 08:24:35AM +0000, Hangbin Liu wrote:
> > If it's ipvlan that fails rather than macvlan there is a bunch of
> > otherhost drops:
> >
> > # 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> > # link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
> > # RX: bytes packets errors dropped missed mcast
> > # 702 10 0 0 0 3
> > # RX errors: length crc frame fifo overrun otherhost
> > # 0 0 0 0 0 4
>
> Hmm, this one is suspicious. I can reproduce the ping fail on local.
> But no "otherhost" issue. I will check the failure recently.
This looks like a time-sensitive issue, with
diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
index c4711272fe45..947c85ec2cbb 100755
--- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
+++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
@@ -30,6 +30,7 @@ check_connection()
local message=${3}
RET=0
+ sleep 1
ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
check_err $? "ping failed"
log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
That maybe why debug kernel works good.
I need some time to figure out what configure affect the issue.
Thanks
Hangbin
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-18 6:03 ` Hangbin Liu
@ 2025-11-18 15:13 ` Jakub Kicinski
2025-11-26 15:19 ` Jakub Kicinski
0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-18 15:13 UTC (permalink / raw)
To: Hangbin Liu; +Cc: netdev
On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > But no "otherhost" issue. I will check the failure recently.
>
> This looks like a time-sensitive issue, with
>
> diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> index c4711272fe45..947c85ec2cbb 100755
> --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> @@ -30,6 +30,7 @@ check_connection()
> local message=${3}
> RET=0
>
> + sleep 1
> ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
> check_err $? "ping failed"
> log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
>
> I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> That maybe why debug kernel works good.
I see. I queued up a local change to add a 0.25 sec wait. Let's wait
a couple of days and see how much sleep we need here, this function
is called 96 times if I'm counting right.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-18 15:13 ` Jakub Kicinski
@ 2025-11-26 15:19 ` Jakub Kicinski
2025-11-27 1:14 ` Hangbin Liu
0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-26 15:19 UTC (permalink / raw)
To: Hangbin Liu; +Cc: netdev
On Tue, 18 Nov 2025 07:13:02 -0800 Jakub Kicinski wrote:
> On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > > But no "otherhost" issue. I will check the failure recently.
> >
> > This looks like a time-sensitive issue, with
> >
> > diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > index c4711272fe45..947c85ec2cbb 100755
> > --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > @@ -30,6 +30,7 @@ check_connection()
> > local message=${3}
> > RET=0
> >
> > + sleep 1
> > ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
> > check_err $? "ping failed"
> > log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
> >
> > I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> > That maybe why debug kernel works good.
>
> I see. I queued up a local change to add a 0.25 sec wait. Let's wait
> a couple of days and see how much sleep we need here, this function
> is called 96 times if I'm counting right.
Hi Hangbin!
The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
Would you mind submitting it officially?
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-26 15:19 ` Jakub Kicinski
@ 2025-11-27 1:14 ` Hangbin Liu
2025-11-27 1:18 ` Hangbin Liu
0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27 1:14 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Wed, Nov 26, 2025 at 07:19:30AM -0800, Jakub Kicinski wrote:
> On Tue, 18 Nov 2025 07:13:02 -0800 Jakub Kicinski wrote:
> > On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > > > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > > > But no "otherhost" issue. I will check the failure recently.
> > >
> > > This looks like a time-sensitive issue, with
> > >
> > > diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > index c4711272fe45..947c85ec2cbb 100755
> > > --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > @@ -30,6 +30,7 @@ check_connection()
> > > local message=${3}
> > > RET=0
> > >
> > > + sleep 1
> > > ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
> > > check_err $? "ping failed"
> > > log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
> > >
> > > I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> > > That maybe why debug kernel works good.
> >
> > I see. I queued up a local change to add a 0.25 sec wait. Let's wait
> > a couple of days and see how much sleep we need here, this function
> > is called 96 times if I'm counting right.
>
> Hi Hangbin!
>
> The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> Would you mind submitting it officially?
Good to hear this. I will submit it.
Thanks
Hangbin
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-27 1:14 ` Hangbin Liu
@ 2025-11-27 1:18 ` Hangbin Liu
2025-11-27 1:41 ` Jakub Kicinski
0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27 1:18 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > I see. I queued up a local change to add a 0.25 sec wait. Let's wait
> > > a couple of days and see how much sleep we need here, this function
> > > is called 96 times if I'm counting right.
> >
> > Hi Hangbin!
> >
> > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > Would you mind submitting it officially?
>
> Good to hear this. I will submit it.
Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
0.25s or extend it to 0.5s to avoid flaky tests later?
Thanks
Hangbin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-27 1:18 ` Hangbin Liu
@ 2025-11-27 1:41 ` Jakub Kicinski
2025-11-27 2:04 ` Hangbin Liu
0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-27 1:41 UTC (permalink / raw)
To: Hangbin Liu; +Cc: netdev
On Thu, 27 Nov 2025 01:18:30 +0000 Hangbin Liu wrote:
> On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > Hi Hangbin!
> > >
> > > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > > Would you mind submitting it officially?
> >
> > Good to hear this. I will submit it.
>
> Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
> 0.25s or extend it to 0.5s to avoid flaky tests later?
I'd stick to 0.25sec since it was solid for a week.
I don't think the race window is very large, we could even experiment
with a smaller delay, because debug kernels don't hit the issue. The
debug kernel can't be >0.1sec slower I reckon.
IOW I hope 0.25sec already has pretty solid safety margin?
As I mentioned last week - this is called almost 100 times by the test
so the longer delays will be quite visible in the test runtime.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
2025-11-27 1:41 ` Jakub Kicinski
@ 2025-11-27 2:04 ` Hangbin Liu
0 siblings, 0 replies; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27 2:04 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Wed, Nov 26, 2025 at 05:41:44PM -0800, Jakub Kicinski wrote:
> On Thu, 27 Nov 2025 01:18:30 +0000 Hangbin Liu wrote:
> > On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > > Hi Hangbin!
> > > >
> > > > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > > > Would you mind submitting it officially?
> > >
> > > Good to hear this. I will submit it.
> >
> > Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
> > 0.25s or extend it to 0.5s to avoid flaky tests later?
>
> I'd stick to 0.25sec since it was solid for a week.
> I don't think the race window is very large, we could even experiment
> with a smaller delay, because debug kernels don't hit the issue. The
> debug kernel can't be >0.1sec slower I reckon.
>
> IOW I hope 0.25sec already has pretty solid safety margin?
> As I mentioned last week - this is called almost 100 times by the test
> so the longer delays will be quite visible in the test runtime.
Makes sense, the test do loop too many times..
Thanks
Hangbin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-11-27 2:04 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 16:20 [TEST] bond_macvlan_ipvlan.sh flakiness Jakub Kicinski
2025-11-17 8:24 ` Hangbin Liu
2025-11-18 6:03 ` Hangbin Liu
2025-11-18 15:13 ` Jakub Kicinski
2025-11-26 15:19 ` Jakub Kicinski
2025-11-27 1:14 ` Hangbin Liu
2025-11-27 1:18 ` Hangbin Liu
2025-11-27 1:41 ` Jakub Kicinski
2025-11-27 2:04 ` Hangbin Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).