netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [TEST] bond_macvlan_ipvlan.sh flakiness
@ 2025-11-14 16:20 Jakub Kicinski
  2025-11-17  8:24 ` Hangbin Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-14 16:20 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev

Hi Hangbin!

The flakiness of bond_macvlan_ipvlan.sh has increased quite a lot
recently. Not sure if there's any correlation with kernel changes,
I didn't spot anything in bonding itself. Here's the history of runs:

https://netdev.bots.linux.dev/contest.html?executor=vmksft-bonding&test=bond-macvlan-ipvlan-sh

It looks like it's gotten much worse starting around the 9th?

Only the non-debug kernel build is flaking, debug builds are completely
clear:

https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=bond-macvlan-ipvlan-sh

A few things that stood out to me, all the failures are like this:

# TEST: balance-$lb/$$$vlan_bridge: IPv4: client->$$$vlan_2   [FAIL]

Always IPv4 ping to the second interface, always fails neighbor
resolution:

# 192.0.2.12 dev eth0 FAILED 

If it's ipvlan that fails rather than macvlan there is a bunch of
otherhost drops:

# 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
#     link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
#     RX:  bytes packets errors dropped  missed   mcast           
#            702      10      0       0       0       3 
#     RX errors:  length    crc   frame    fifo overrun otherhost
#                      0      0       0       0       0         4

FWIW here's the contents of the branches if you want to look thru:
https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-2025-11-09--12-00.html
but 9th was the weekend, and the failure just got more frequent,
we've been trying to track this down for a while..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-14 16:20 [TEST] bond_macvlan_ipvlan.sh flakiness Jakub Kicinski
@ 2025-11-17  8:24 ` Hangbin Liu
  2025-11-18  6:03   ` Hangbin Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-17  8:24 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Fri, Nov 14, 2025 at 08:20:14AM -0800, Jakub Kicinski wrote:
> Hi Hangbin!
> 
> The flakiness of bond_macvlan_ipvlan.sh has increased quite a lot
> recently. Not sure if there's any correlation with kernel changes,
> I didn't spot anything in bonding itself. Here's the history of runs:
> 
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-bonding&test=bond-macvlan-ipvlan-sh
> 
> It looks like it's gotten much worse starting around the 9th?
> 
> Only the non-debug kernel build is flaking, debug builds are completely
> clear:
> 
> https://netdev.bots.linux.dev/flakes.html?min-flip=0&tn-needle=bond-macvlan-ipvlan-sh
> 
> A few things that stood out to me, all the failures are like this:
> 
> # TEST: balance-$lb/$$$vlan_bridge: IPv4: client->$$$vlan_2   [FAIL]
> 
> Always IPv4 ping to the second interface, always fails neighbor
> resolution:
> 
> # 192.0.2.12 dev eth0 FAILED 
> 
> If it's ipvlan that fails rather than macvlan there is a bunch of
> otherhost drops:
> 
> # 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> #     link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
> #     RX:  bytes packets errors dropped  missed   mcast           
> #            702      10      0       0       0       3 
> #     RX errors:  length    crc   frame    fifo overrun otherhost
> #                      0      0       0       0       0         4

Hmm, this one is suspicious. I can reproduce the ping fail on local.
But no "otherhost" issue. I will check the failure recently.

Thanks
Hangbin
> 
> FWIW here's the contents of the branches if you want to look thru:
> https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-2025-11-09--12-00.html
> but 9th was the weekend, and the failure just got more frequent,
> we've been trying to track this down for a while..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-17  8:24 ` Hangbin Liu
@ 2025-11-18  6:03   ` Hangbin Liu
  2025-11-18 15:13     ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-18  6:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Mon, Nov 17, 2025 at 08:24:35AM +0000, Hangbin Liu wrote:
> > If it's ipvlan that fails rather than macvlan there is a bunch of
> > otherhost drops:
> > 
> > # 17: ipvlan0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> > #     link/ether 00:0a:0b:0c:0d:01 brd ff:ff:ff:ff:ff:ff link-netns s-8BLcCn
> > #     RX:  bytes packets errors dropped  missed   mcast           
> > #            702      10      0       0       0       3 
> > #     RX errors:  length    crc   frame    fifo overrun otherhost
> > #                      0      0       0       0       0         4
> 
> Hmm, this one is suspicious. I can reproduce the ping fail on local.
> But no "otherhost" issue. I will check the failure recently.

This looks like a time-sensitive issue, with

diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
index c4711272fe45..947c85ec2cbb 100755
--- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
+++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
@@ -30,6 +30,7 @@ check_connection()
        local message=${3}
        RET=0

+       sleep 1
        ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
        check_err $? "ping failed"
        log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"

I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
That maybe why debug kernel works good.

I need some time to figure out what configure affect the issue.

Thanks
Hangbin

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-18  6:03   ` Hangbin Liu
@ 2025-11-18 15:13     ` Jakub Kicinski
  2025-11-26 15:19       ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-18 15:13 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev

On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > But no "otherhost" issue. I will check the failure recently.  
> 
> This looks like a time-sensitive issue, with
> 
> diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> index c4711272fe45..947c85ec2cbb 100755
> --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> @@ -30,6 +30,7 @@ check_connection()
>         local message=${3}
>         RET=0
> 
> +       sleep 1
>         ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
>         check_err $? "ping failed"
>         log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
> 
> I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> That maybe why debug kernel works good.

I see. I queued up a local change to add a 0.25 sec wait. Let's wait 
a couple of days and see how much sleep we need here, this function 
is called 96 times if I'm counting right.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-18 15:13     ` Jakub Kicinski
@ 2025-11-26 15:19       ` Jakub Kicinski
  2025-11-27  1:14         ` Hangbin Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-26 15:19 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev

On Tue, 18 Nov 2025 07:13:02 -0800 Jakub Kicinski wrote:
> On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > > But no "otherhost" issue. I will check the failure recently.    
> > 
> > This looks like a time-sensitive issue, with
> > 
> > diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > index c4711272fe45..947c85ec2cbb 100755
> > --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > @@ -30,6 +30,7 @@ check_connection()
> >         local message=${3}
> >         RET=0
> > 
> > +       sleep 1
> >         ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
> >         check_err $? "ping failed"
> >         log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
> > 
> > I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> > That maybe why debug kernel works good.  
> 
> I see. I queued up a local change to add a 0.25 sec wait. Let's wait 
> a couple of days and see how much sleep we need here, this function 
> is called 96 times if I'm counting right.

Hi Hangbin!

The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
Would you mind submitting it officially?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-26 15:19       ` Jakub Kicinski
@ 2025-11-27  1:14         ` Hangbin Liu
  2025-11-27  1:18           ` Hangbin Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27  1:14 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Wed, Nov 26, 2025 at 07:19:30AM -0800, Jakub Kicinski wrote:
> On Tue, 18 Nov 2025 07:13:02 -0800 Jakub Kicinski wrote:
> > On Tue, 18 Nov 2025 06:03:17 +0000 Hangbin Liu wrote:
> > > > Hmm, this one is suspicious. I can reproduce the ping fail on local.
> > > > But no "otherhost" issue. I will check the failure recently.    
> > > 
> > > This looks like a time-sensitive issue, with
> > > 
> > > diff --git a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > index c4711272fe45..947c85ec2cbb 100755
> > > --- a/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > +++ b/tools/testing/selftests/drivers/net/bonding/bond_macvlan_ipvlan.sh
> > > @@ -30,6 +30,7 @@ check_connection()
> > >         local message=${3}
> > >         RET=0
> > > 
> > > +       sleep 1
> > >         ip netns exec ${ns} ping ${target} -c 4 -i 0.1 &>/dev/null
> > >         check_err $? "ping failed"
> > >         log_test "${bond_mode}/${xvlan_type}_${xvlan_mode}: ${message}"
> > > 
> > > I run the test 100 times (vng with 4 cpus) and not able to reproduce it anymore.
> > > That maybe why debug kernel works good.  
> > 
> > I see. I queued up a local change to add a 0.25 sec wait. Let's wait 
> > a couple of days and see how much sleep we need here, this function 
> > is called 96 times if I'm counting right.
> 
> Hi Hangbin!
> 
> The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> Would you mind submitting it officially?

Good to hear this. I will submit it.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-27  1:14         ` Hangbin Liu
@ 2025-11-27  1:18           ` Hangbin Liu
  2025-11-27  1:41             ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27  1:18 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > I see. I queued up a local change to add a 0.25 sec wait. Let's wait 
> > > a couple of days and see how much sleep we need here, this function 
> > > is called 96 times if I'm counting right.
> > 
> > Hi Hangbin!
> > 
> > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > Would you mind submitting it officially?
> 
> Good to hear this. I will submit it.

Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
0.25s or extend it to 0.5s to avoid flaky tests later?

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-27  1:18           ` Hangbin Liu
@ 2025-11-27  1:41             ` Jakub Kicinski
  2025-11-27  2:04               ` Hangbin Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2025-11-27  1:41 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev

On Thu, 27 Nov 2025 01:18:30 +0000 Hangbin Liu wrote:
> On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > Hi Hangbin!
> > > 
> > > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > > Would you mind submitting it officially?  
> > 
> > Good to hear this. I will submit it.  
> 
> Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
> 0.25s or extend it to 0.5s to avoid flaky tests later?

I'd stick to 0.25sec since it was solid for a week.
I don't think the race window is very large, we could even experiment
with a smaller delay, because debug kernels don't hit the issue. The
debug kernel can't be >0.1sec slower I reckon.

IOW I hope 0.25sec already has pretty solid safety margin?
As I mentioned last week - this is called almost 100 times by the test
so the longer delays will be quite visible in the test runtime.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [TEST] bond_macvlan_ipvlan.sh flakiness
  2025-11-27  1:41             ` Jakub Kicinski
@ 2025-11-27  2:04               ` Hangbin Liu
  0 siblings, 0 replies; 9+ messages in thread
From: Hangbin Liu @ 2025-11-27  2:04 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev

On Wed, Nov 26, 2025 at 05:41:44PM -0800, Jakub Kicinski wrote:
> On Thu, 27 Nov 2025 01:18:30 +0000 Hangbin Liu wrote:
> > On Thu, Nov 27, 2025 at 01:15:00AM +0000, Hangbin Liu wrote:
> > > > Hi Hangbin!
> > > > 
> > > > The 0.25 sec sleep was added locally 1 week ago and 0 flakes since.
> > > > Would you mind submitting it officially?  
> > > 
> > > Good to hear this. I will submit it.  
> > 
> > Oh, I pressed the send button too fast. I forgot to ask—should we keep it at
> > 0.25s or extend it to 0.5s to avoid flaky tests later?
> 
> I'd stick to 0.25sec since it was solid for a week.
> I don't think the race window is very large, we could even experiment
> with a smaller delay, because debug kernels don't hit the issue. The
> debug kernel can't be >0.1sec slower I reckon.
> 
> IOW I hope 0.25sec already has pretty solid safety margin?
> As I mentioned last week - this is called almost 100 times by the test
> so the longer delays will be quite visible in the test runtime.

Makes sense, the test do loop too many times..

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-27  2:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 16:20 [TEST] bond_macvlan_ipvlan.sh flakiness Jakub Kicinski
2025-11-17  8:24 ` Hangbin Liu
2025-11-18  6:03   ` Hangbin Liu
2025-11-18 15:13     ` Jakub Kicinski
2025-11-26 15:19       ` Jakub Kicinski
2025-11-27  1:14         ` Hangbin Liu
2025-11-27  1:18           ` Hangbin Liu
2025-11-27  1:41             ` Jakub Kicinski
2025-11-27  2:04               ` Hangbin Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).