public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] mlx5: VLAN-aware bridge drops all traffic in legacy eswitch mode without promiscuous
@ 2026-04-24 11:07 bryan
  2026-04-27 13:55 ` Dragos Tatulea
  0 siblings, 1 reply; 3+ messages in thread
From: bryan @ 2026-04-24 11:07 UTC (permalink / raw)
  To: netdev; +Cc: saeedm, tariqt

Good day,

I wanted to check whether there is an open bug report or known fix in
progress for an issue that has been affecting mlx5 users (specifically
ConnectX-4 Lx, but likely broader from what I have seen other
reporting) since at least 2021:

When an mlx5 interface is added as a port to a VLAN-aware Linux bridge
(bridge-vlan-aware yes / vlan_filtering 1) in legacy eswitch mode, all
traffic stops passing through the bridge. Both tagged and untagged
traffic is affected. The same configuration works correctly with non-
mlx5 NICs (tested Intel, Chelsio cards).

The only known workarounds are:
1. Enable promiscuous mode on the interface (ip link set dev <iface>
promisc on), which bypasses hardware VLAN filtering but has security
and performance implications. (this is what I am doing on my systems at
the moment)
2. Switch the eswitch to switchdev mode, which was fixed for a kernel
panic in February 2023 (net/mlx5e: Fix crash unsetting rx-vlan-filter
in switchdev mode) but introduces other issues including MDB errors and
is not suitable for all configurations. 

Based on reports I have seen from other in forums, this appears to have
been introduced somewhere around kernel 6.1-6.5, possibly related to a
commit that changed promiscuous mode efficiency in mlx5_core. I was not
using this hardware at the time, and cannot confirm firsthand. The
NVIDIA out-of-tree MLNX_EN driver does not exhibit this behavior in
legacy eswitch mode, which strongly suggests this is a regression in
the upstream mlx5 driver rather than a firmware or hardware issue. I do
not have first-hand experience with the mlx5 driver ever working
correctly - the idea that it did historically work correctly is based
purely on the reports of others (and the existence of old setup guides
that do not mention needing to try either of these workarounds.)

If it helps at all, I have tried various firmware versions on ConnectX-
4 Lx cards ranging from from an old release from 2017 all the way up to
the latest 14_32_1912. There has been no difference in behaviour with
regard to this issue. 

This is well documented in community forums but does not appear to have
been formally reported to netdev that I have been able to find. My
apologies in advance if this has been reported and I wasn't able to
locate it. Here are a couple of forum examples where this is discussed
among other affected users:

- NVIDIA Developer Forum (opened 2021, unresolved):
 
https://forums.developer.nvidia.com/t/vlan-aware-linux-bridging-is-not-functional-on-connectx4lx-card-unless-manually-put-in-promiscuous-mode/206083

- Proxmox Forum thread (2023, ongoing):
 
https://forum.proxmox.com/threads/mellanox-connectx-4-lx-and-brigde-vlan-aware-on-proxmox-8-0-1.130902/

- Community writeup with analysis:
  https://www.apalrd.net/posts/2023/tip_mellanox/

Has anyone bisected this or is there a fix already in progress that I
did not find? This affects a fairly common hypervisor configuration
(VLAN-aware bridge for VM networking) and the workarounds are not
conducive to production use.



Thank you for your time,

Bryan Pliscott

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] mlx5: VLAN-aware bridge drops all traffic in legacy eswitch mode without promiscuous
  2026-04-24 11:07 [BUG] mlx5: VLAN-aware bridge drops all traffic in legacy eswitch mode without promiscuous bryan
@ 2026-04-27 13:55 ` Dragos Tatulea
  2026-04-27 21:10   ` bryan
  0 siblings, 1 reply; 3+ messages in thread
From: Dragos Tatulea @ 2026-04-27 13:55 UTC (permalink / raw)
  To: bryan, netdev; +Cc: saeedm, tariqt

Hi,

On 24.04.26 13:07, bryan wrote:
> Good day,
>
> I wanted to check whether there is an open bug report or known fix in
> progress for an issue that has been affecting mlx5 users (specifically
> ConnectX-4 Lx, but likely broader from what I have seen other
> reporting) since at least 2021:
>
> When an mlx5 interface is added as a port to a VLAN-aware Linux bridge
> (bridge-vlan-aware yes / vlan_filtering 1) in legacy eswitch mode, all
> traffic stops passing through the bridge. Both tagged and untagged
> traffic is affected. The same configuration works correctly with non-
> mlx5 NICs (tested Intel, Chelsio cards).
>
Is this even with one vlan? I ran a flow on a CX4LX pair with one vlan
and vlan_filtering set and traffic seems to be flowing normally.
Something like:

# IFACE=eth2
# VID=100
# ip link add br0 type bridge vlan_filtering 1
# ip link set "$IFACE" master br0
# bridge vlan add vid "$VID" dev "$IFACE"
# bridge vlan add vid "$VID" dev br0 self
# ip link add link br0 name "br0.$VID" type vlan id "$VID"
# ip addr add 10.0.0.1/24 dev br0
# ip addr add "10.0.$VID.1/24" dev "br0.$VID"
# ip link set "$IFACE" up
# ip link set br0 up
# ip link set "br0.$VID" up

From the other side where I have a similar setup I can ping
br0.100.

Tested on a CX4LX with FW version 28.48.1000 and kernel 6.18.
eth2 is a PF in legacy switchdev mode.

> [...]
> This is well documented in community forums but does not appear to have
> been formally reported to netdev that I have been able to find. My
> apologies in advance if this has been reported and I wasn't able to
> locate it. Here are a couple of forum examples where this is discussed
> among other affected users:
>
> - NVIDIA Developer Forum (opened 2021, unresolved):
>
> https://forums.developer.nvidia.com/t/vlan-aware-linux-bridging-is-not-functional-on-connectx4lx-card-unless-manually-put-in-promiscuous-mode/206083
>
> - Proxmox Forum thread (2023, ongoing):
>
> https://forum.proxmox.com/threads/mellanox-connectx-4-lx-and-brigde-vlan-aware-on-proxmox-8-0-1.130902/
>
> - Community writeup with analysis:
>   https://www.apalrd.net/posts/2023/tip_mellanox/
>
This last link seems the only one that provides some extra data. From it
I can see that the amount of VLAN ids > what the FW supports. This could
result in loss of traffic for the vlan ids > 512. Do you also see in
your dmesg these kinds of errors:

mlx5_core 0000:19:00.1: mlx5e_vport_context_update_vlans:179:(pid 13470): netdev vlans list size (4080) > (512) max vport list size, some vlans will be dropped

This is not a bug, simply a limit being reached.

> Has anyone bisected this or is there a fix already in progress that I
> did not find? This affects a fairly common hypervisor configuration
> (VLAN-aware bridge for VM networking) and the workarounds are not
> conducive to production use.
>
Could you provide a short repro script for this. Not being able to
reproduce the issue makes it hard to check :).

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] mlx5: VLAN-aware bridge drops all traffic in legacy eswitch mode without promiscuous
  2026-04-27 13:55 ` Dragos Tatulea
@ 2026-04-27 21:10   ` bryan
  0 siblings, 0 replies; 3+ messages in thread
From: bryan @ 2026-04-27 21:10 UTC (permalink / raw)
  To: Dragos Tatulea, netdev; +Cc: saeedm, tariqt

Here would be one example config (sanitized). The promisc on link is
what allows traffic to pass - I disable promisc, and traffic stops.
These are single-port CX4Lx cards. nic0 is the physical interface, no
VFs configured, SrIOV has been disabled as part of testing and
troubleshooting, Kernel 6.17 currently:

auto lo
iface lo inet loopback

auto nic0
iface nic0 inet manual
        up ip link set nic0 promisc on
auto vmbr0
iface vmbr0 inet manual
        bridge-ports nic0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-10 555

auto vmbr0.555
iface vmbr0.555 inet static
        address 192.168.1.123/24
        gateway 192.168.1.1


iface nic3 inet manual

iface nic1 inet manual

iface nic2 inet manual


This was ported over to use the new nic# bindings, before this it was
the standard enps0np0 naming. No difference in behvaiour. 


>Is this even with one vlan? I ran a flow on a CX4LX pair with one vlan
>and vlan_filtering set and traffic seems to be flowing normally.

I have not checked with literally only one VLAN, as that is not at all
the use case. I can absolutely test that if it would help! Would you
like me to remove every VLAN but 555 from the interface, and leave the
rest of the config as-is? 



>This last link seems the only one that provides some extra data. From
it I can see that the amount of VLAN ids > what the FW supports. This
could result in loss of traffic for the vlan ids > 512. Do you also see
in your dmesg these kinds of errors:

>mlx5_core 0000:19:00.1: mlx5e_vport_context_update_vlans:179:(pid
13470): netdev vlans list size (4080) > (512) max vport list size, some
vlans will be dropped

>This is not a bug, simply a limit being reached.

Considering I am only operating with 10 VLANs, as can be seen in my
config, I do not think that is my issue. I am aware that there is also
quite a bit of noise in these threads - they are just forum posts - But
that does not appear to me to be related to my issue, or the issue of
others here. I get no such messages or warnings. Additionally reports
about that bug were about some some VLANs being dropped. In my case
(and as others have reported) all VLANs are being dropped. 

>eth2 is a PF in legacy switchdev mode.

It was my understanding that Legacy mode and Switchdev mode were two
independent modes, with Legacy done in-software and Switchdev using the
eSwitch on the NIC itself. Please excuse my ignorance if that is not
the case. Would you be able to specify if you used Switchdev mode or
Legacy mode? because Switchdev mode DOES function as a workaround and
passes traffic (but in my case results in system instability after a
time). 

Thank You,
Bryan



On Mon, 2026-04-27 at 15:55 +0200, Dragos Tatulea wrote:
> Hi,
> 
> On 24.04.26 13:07, bryan wrote:
> > Good day,
> > 
> > I wanted to check whether there is an open bug report or known fix
> > in
> > progress for an issue that has been affecting mlx5 users
> > (specifically
> > ConnectX-4 Lx, but likely broader from what I have seen other
> > reporting) since at least 2021:
> > 
> > When an mlx5 interface is added as a port to a VLAN-aware Linux
> > bridge
> > (bridge-vlan-aware yes / vlan_filtering 1) in legacy eswitch mode,
> > all
> > traffic stops passing through the bridge. Both tagged and untagged
> > traffic is affected. The same configuration works correctly with
> > non-
> > mlx5 NICs (tested Intel, Chelsio cards).
> > 
> Is this even with one vlan? I ran a flow on a CX4LX pair with one
> vlan
> and vlan_filtering set and traffic seems to be flowing normally.
> Something like:
> 
> # IFACE=eth2
> # VID=100
> # ip link add br0 type bridge vlan_filtering 1
> # ip link set "$IFACE" master br0
> # bridge vlan add vid "$VID" dev "$IFACE"
> # bridge vlan add vid "$VID" dev br0 self
> # ip link add link br0 name "br0.$VID" type vlan id "$VID"
> # ip addr add 10.0.0.1/24 dev br0
> # ip addr add "10.0.$VID.1/24" dev "br0.$VID"
> # ip link set "$IFACE" up
> # ip link set br0 up
> # ip link set "br0.$VID" up
> 
> From the other side where I have a similar setup I can ping
> br0.100.
> 
> Tested on a CX4LX with FW version 28.48.1000 and kernel 6.18.
> eth2 is a PF in legacy switchdev mode.
> 
> > [...]
> > This is well documented in community forums but does not appear to
> > have
> > been formally reported to netdev that I have been able to find. My
> > apologies in advance if this has been reported and I wasn't able to
> > locate it. Here are a couple of forum examples where this is
> > discussed
> > among other affected users:
> > 
> > - NVIDIA Developer Forum (opened 2021, unresolved):
> > 
> > https://forums.developer.nvidia.com/t/vlan-aware-linux-bridging-is-not-functional-on-connectx4lx-card-unless-manually-put-in-promiscuous-mode/206083
> > 
> > - Proxmox Forum thread (2023, ongoing):
> > 
> > https://forum.proxmox.com/threads/mellanox-connectx-4-lx-and-brigde-vlan-aware-on-proxmox-8-0-1.130902/
> > 
> > - Community writeup with analysis:
> >   https://www.apalrd.net/posts/2023/tip_mellanox/
> > 
> This last link seems the only one that provides some extra data. From
> it
> I can see that the amount of VLAN ids > what the FW supports. This
> could
> result in loss of traffic for the vlan ids > 512. Do you also see in
> your dmesg these kinds of errors:
> 
> mlx5_core 0000:19:00.1: mlx5e_vport_context_update_vlans:179:(pid
> 13470): netdev vlans list size (4080) > (512) max vport list size,
> some vlans will be dropped
> 
> This is not a bug, simply a limit being reached.
> 
> > Has anyone bisected this or is there a fix already in progress that
> > I
> > did not find? This affects a fairly common hypervisor configuration
> > (VLAN-aware bridge for VM networking) and the workarounds are not
> > conducive to production use.
> > 
> Could you provide a short repro script for this. Not being able to
> reproduce the issue makes it hard to check :).
> 
> Thanks,
> Dragos

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-27 21:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 11:07 [BUG] mlx5: VLAN-aware bridge drops all traffic in legacy eswitch mode without promiscuous bryan
2026-04-27 13:55 ` Dragos Tatulea
2026-04-27 21:10   ` bryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox