public inbox for netdev@vger.kernel.org

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 07/14] net: bridge: mcast: track active state, own MLD querier disappearance
From: Ido Schimmel @ 2026-02-08 16:09 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-8-linus.luessing@c0d3.blue>

In subject: s/own MLD querier disappearance/own querier disappearance/ ?

On Fri, Feb 06, 2026 at 03:52:13AM +0100, Linus Lüssing wrote:
> This change ensures that the new multicast active state variable is
> immediately unset if our internal IGMP/MLD querier was elected and
> now disabled.
> 
> If no IGMP/MLD querier exists on the link then we can't reliably receive
> IGMP/MLD reports and in turn can't ensure the completeness of our MDB
> anymore either.
> 
> No functional change for the fast/data path yet. This is the last
> necessary check before using the new multicast active state variable
> in the fast/data path, too.

The last sentence needs to be dropped?

> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>

Code looks fine:

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---
>  net/bridge/br_multicast.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index a1cde2ba2a3e..2710c21daef5 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -4914,6 +4914,7 @@ int br_multicast_set_querier(struct net_bridge_mcast *brmctx, unsigned long val)
>  #endif
>  
>  unlock:
> +	br_multicast_update_active(brmctx);
>  	spin_unlock_bh(&brmctx->br->multicast_lock);
>  
>  	return 0;
> -- 
> 2.51.0
> 

^ permalink raw reply

* Re: [PATCH net-next v2 06/14] net: bridge: mcast: track active state, IPv6 address availability
From: Ido Schimmel @ 2026-02-08 16:08 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-7-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:12AM +0100, Linus Lüssing wrote:
> If we are the only potential MLD querier but don't have an IPv6
> link-local address configured on our bridge interface then we can't
> create a valid MLD query and in turn can't reliably receive MLD reports
> and can't build a complete MDB. Hence disable the new multicast active
> state variable then. Or reenable it if an IPv6 link-local address
> became available.
> 
> No functional change for the fast/data path yet.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

> ---
>  net/bridge/br_multicast.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index 0fc29875db9c..a1cde2ba2a3e 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -1125,6 +1125,7 @@ static void br_multicast_notify_active(struct net_bridge_mcast *brmctx,
>   * The multicast active state is set, per protocol family, if:
>   *
>   * - an IGMP/MLD querier is present
> + * - for own IPv6 MLD querier: an IPv6 address is configured on the bridge

an IPv6 link-local address

>   *
>   * And is unset otherwise.
>   *
> @@ -1222,10 +1223,12 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge_mcast *brm
>  			       &ip6h->daddr, 0, &ip6h->saddr)) {
>  		kfree_skb(skb);
>  		br_opt_toggle(brmctx->br, BROPT_HAS_IPV6_ADDR, false);
> +		br_multicast_update_active(brmctx);
>  		return NULL;
>  	}
>  
>  	br_opt_toggle(brmctx->br, BROPT_HAS_IPV6_ADDR, true);
> +	br_multicast_update_active(brmctx);
>  	ipv6_eth_mc_map(&ip6h->daddr, eth->h_dest);
>  
>  	hopopt = (u8 *)(ip6h + 1);
> -- 
> 2.51.0
> 

^ permalink raw reply

* Re: [PATCH net-next v2 05/14] net: bridge: mcast: track active state, foreign IGMP/MLD querier disappearance
From: Ido Schimmel @ 2026-02-08 16:08 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-6-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:11AM +0100, Linus Lüssing wrote:
> This change ensures that the new multicast active state variable is unset
> again after a foreign IGMP/MLD querier has disappeared (default: 255
> seconds). If no new, other IGMP/MLD querier took over then we can't
> reliably receive IGMP/MLD reports anymore and in turn can't ensure the
> completeness of our MDB anymore either.
> 
> No functional change for the fast/data path yet.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* Re: [PATCH net-next v2 04/14] net: bridge: mcast: track active state, IGMP/MLD querier appearance
From: Ido Schimmel @ 2026-02-08 16:07 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-5-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:10AM +0100, Linus Lüssing wrote:
> +static void br_multicast_notify_active(struct net_bridge_mcast *brmctx,
> +				       bool ip4_active_old, bool ip6_active_old)
> +{
> +	if (brmctx->ip4_active == ip4_active_old &&
> +	    brmctx->ip6_active == ip6_active_old)
> +		return;
> +
> +	br_info(brmctx->br, "mc_active changed, vid: %i: v4: %i->%i, v6: %i->%i\n",
> +		brmctx->vlan ? brmctx->vlan->vid : -1,
> +		ip4_active_old, brmctx->ip4_active,
> +		ip6_active_old, brmctx->ip6_active);

Make this br_debug() to avoid spamming the kernel log?

I am aware that this can also be notified over netlink, but it will add
extra complexity and I am not sure anyone will use, so it might be best
to defer it for now.

> +}

^ permalink raw reply

* Re: [PATCH net-next v2 03/14] net: bridge: mcast: avoid sleeping on bridge-down
From: Ido Schimmel @ 2026-02-08 16:01 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-4-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:09AM +0100, Linus Lüssing wrote:
> We later want to use the multicast lock when setting the bridge
> interface up or down, to be able to atomically both check all conditions
> to toggle the multicast active state and to subsequently toggle it.
> While most variables we check / contexts we check from are serialized
> (toggled variables through netlink/sysfs) the timer_pending() check is
> not and might run in parallel.
> 
> However so far we are not allowed to spinlock __br_multicast_stop() as
> its call to timer_delete_sync() might sleep. Therefore replacing the
> sleeping variant with the non-sleeping one. It is sufficient to only
> wait for any timer callback to finish when we are freeing the multicast
> context.
> 
> Using the timer_shutdown() instead of the timer_delete() variant also
> allows us to detect that we are stopping from within the according timer
> callbacks, to retain the promise of the previous timer_delete_sync()
> calls that no multicast state is changed after these
> timer_{delete,shutdown}*() calls. And more importantly that we are not
> inadvertently rearming timers.

Can you clarify what you mean by "allows us to detect that we are
stopping from within the according timer callbacks"?

> 
> This new check also makes the netif_running() check redundant/obsolete
> in these contexts.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
> ---
>  net/bridge/br_device.c    |   4 ++
>  net/bridge/br_multicast.c | 108 ++++++++++++++++++++++++++------------
>  net/bridge/br_private.h   |   5 ++
>  net/bridge/br_vlan.c      |   5 ++
>  4 files changed, 87 insertions(+), 35 deletions(-)
> 
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index a818fdc22da9..d9d1227d5708 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -168,7 +168,9 @@ static int br_dev_open(struct net_device *dev)
>  	netdev_update_features(dev);
>  	netif_start_queue(dev);
>  	br_stp_enable_bridge(br);
> +	spin_lock_bh(&br->multicast_lock);
>  	br_multicast_open(br);

Maybe move the spin_lock_bh() / spin_unlock_bh() to br_multicast_open()
and have it call br_multicast_open_locked() that will also be invoked
from br_multicast_toggle()?

> +	spin_unlock_bh(&br->multicast_lock);
>  
>  	if (br_opt_get(br, BROPT_MULTICAST_ENABLED))
>  		br_multicast_join_snoopers(br);
> @@ -191,7 +193,9 @@ static int br_dev_stop(struct net_device *dev)
>  	struct net_bridge *br = netdev_priv(dev);
>  
>  	br_stp_disable_bridge(br);
> +	spin_lock_bh(&br->multicast_lock);
>  	br_multicast_stop(br);

And like br_multicast_open(), move the locking into br_multicast_stop()?

> +	spin_unlock_bh(&br->multicast_lock);
>  
>  	if (br_opt_get(br, BROPT_MULTICAST_ENABLED))
>  		br_multicast_leave_snoopers(br);
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index dccae08b4f4c..f5a368dd20a3 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -1665,6 +1665,14 @@ static void br_multicast_router_expired(struct net_bridge_mcast_port *pmctx,
>  	spin_unlock(&br->multicast_lock);
>  }
>  
> +static bool br_multicast_stopping(struct net_bridge *br,

Nit: br_multicast_is_stopping() ?

> +				  struct timer_list *timer)
> +{
> +	lockdep_assert_held_once(&br->multicast_lock);
> +
> +	return !timer->function;
> +}

^ permalink raw reply

* Re: [PATCH net-next v2 02/14] net: bridge: mcast: track active state, adding tests
From: Ido Schimmel @ 2026-02-08 16:00 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-3-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:08AM +0100, Linus Lüssing wrote:
> Before making any significant changes to the internals of the Linux
> bridge add some tests regarding the multicast activity. This is
> also to verify that we have the semantics of the new
> *_MCAST_ACTIVE_{V4,V6} netlink attributes as expected.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
> ---
>  .../net/forwarding/bridge_mdb_active.sh       | 682 ++++++++++++++++++
>  1 file changed, 682 insertions(+)
>  create mode 100755 tools/testing/selftests/net/forwarding/bridge_mdb_active.sh
> 
> diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb_active.sh b/tools/testing/selftests/net/forwarding/bridge_mdb_active.sh
> new file mode 100755
> index 000000000000..5b6e14d88bc2
> --- /dev/null
> +++ b/tools/testing/selftests/net/forwarding/bridge_mdb_active.sh
> @@ -0,0 +1,682 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# +-------+      +---------+
> +# | brq0  |      | br0     |
> +# | + $h1 |      | + $swp1 |
> +# +----|--+      +----|----+
> +#      |              |
> +#      \--------------/
> +#
> +#
> +# This script checks if we have the expected mcast_active_v{4,6} state
> +# on br0 in a variety of scenarios. This state determines if ultimately
> +# multicast snooping is applied to multicast data packets
> +# (multicast snooping active) or if they are (by default) flooded instead
> +# (multicast snooping inactive).
> +#
> +# Notably, multicast snooping can be enabled but still be inactive if not all
> +# requirements to safely apply multicast snooping to multicast data packets
> +# are met.
> +#
> +# Depending on the test case an IGMP/MLD querier might be on brq0, on br0
> +# or neither.
> +
> +
> +ALL_TESTS="
> +	test_inactive
> +	test_active_other_querier
> +	test_active_own_querier
> +	test_inactive_brdown
> +	test_inactive_nov6
> +	test_inactive_snooping_off
> +	test_inactive_querier_off
> +	test_inactive_other_querier_norespdelay
> +	test_inactive_own_querier_norespdelay
> +	test_inactive
> +	test_vlan_inactive
> +	test_vlan_active_other_querier
> +	test_vlan_active_own_querier
> +	test_vlan_inactive_brdown
> +	test_vlan_inactive_nov6
> +	test_vlan_inactive_snooping_off
> +	test_vlan_inactive_vlans_snooping_off
> +	test_vlan_inactive_vlan_snooping_off
> +	test_vlan_inactive_other_querier_norespdelay
> +	test_vlan_inactive_own_querier_norespdelay
> +"
> +
> +NUM_NETIFS=1

Shouldn't this be "2" given below you have "swp1=${NETIFS[p2]}"?

> +MCAST_MAX_RESP_IVAL_SEC=1
> +MCAST_VLAN_ID=42
> +source lib.sh
> +
> +switch_create()
> +{
> +	ip link add dev br0 type bridge\
> +		vlan_filtering 0 \
> +		mcast_query_response_interval $((${MCAST_MAX_RESP_IVAL_SEC}*100))\
> +		mcast_snooping 0 \
> +		mcast_vlan_snooping 0
> +	ip link add dev brq0 type bridge\
> +		vlan_filtering 0 \
> +		mcast_query_response_interval $((${MCAST_MAX_RESP_IVAL_SEC}*100))\
> +		mcast_snooping 0 \
> +		mcast_vlan_snooping 0
> +
> +	echo 1 > /proc/sys/net/ipv6/conf/br0/disable_ipv6
> +	echo 1 > /proc/sys/net/ipv6/conf/brq0/disable_ipv6
> +
> +	ip link set dev $swp1 master br0
> +	ip link set dev $h1 master brq0
> +
> +	ip link set dev $h1 up
> +	ip link set dev $swp1 up
> +}
> +
> +switch_destroy()
> +{
> +	ip link set dev $swp1 down
> +	ip link set dev $h1 down
> +
> +	ip link del dev brq0
> +	ip link del dev br0
> +}
> +
> +setup_prepare()
> +{
> +	h1=${NETIFS[p1]}
> +	swp1=${NETIFS[p2]}
> +
> +	switch_create
> +}
> +
> +cleanup()
> +{
> +	pre_cleanup
> +	switch_destroy
> +}
> +
> +mcast_active_check()
> +{
> +	local af="$1"
> +	local state="$2"
> +
> +	ip -d -j link show dev br0\
> +		| jq -e ".[] | select(.linkinfo.info_data.mcast_active_$af == $state)"\
> +		&> /dev/null
> +
> +	check_err $? "Mcast active check failed"

Please include the address family in the error message to make it
clearer what failed. Same in other places

> +}
> +
> +mcast_vlan_active_check()
> +{
> +	local af="$1"
> +	local state="$2"
> +	local vid="${MCAST_VLAN_ID}"
> +	local ret
> +
> +	bridge -j vlan global show dev br0\
> +		| jq -e ".[].vlans.[] | select(.vlan == $vid and .mcast_active_$af == 1)"\
> +		&> /dev/null
> +	ret="$?"
> +
> +	if [ $ret -eq 0 -a $state -eq 0 ] || [ $ret -ne 0 -a $state -eq 1 ]; then
> +		check_err 1 "Mcast VLAN active check failed"
> +	fi
> +}
> +
> +mcast_assert_active_v4()
> +{
> +	mcast_active_check "v4" "1"
> +}
> +
> +mcast_assert_active_v6()
> +{
> +	mcast_active_check "v6" "1"
> +}
> +
> +mcast_assert_inactive_v4()
> +{
> +	mcast_active_check "v4" "0"
> +}
> +
> +mcast_assert_inactive_v6()
> +{
> +	mcast_active_check "v6" "0"
> +}
> +
> +mcast_vlan_assert_active_v4()
> +{
> +	mcast_vlan_active_check "v4" "1"
> +}
> +
> +mcast_vlan_assert_active_v6()
> +{
> +	mcast_vlan_active_check "v6" "1"
> +}
> +
> +mcast_vlan_assert_inactive_v4()
> +{
> +	mcast_vlan_active_check "v4" "0"
> +}
> +
> +mcast_vlan_assert_inactive_v6()
> +{
> +	mcast_vlan_active_check "v6" "0"
> +}
> +
> +

Double blank line

> +test_inactive_nolog()
> +{
> +	ip link set dev br0 down
> +	ip link set dev brq0 down
> +	ip link set dev br0 type bridge mcast_snooping 0
> +	ip link set dev brq0 type bridge mcast_snooping 0
> +	ip link set dev br0 type bridge mcast_querier 0
> +	ip link set dev brq0 type bridge mcast_querier 0
> +	ip link set dev br0 type bridge mcast_vlan_snooping 0
> +	ip link set dev br0 type bridge vlan_filtering 0
> +
> +	echo 1 > /proc/sys/net/ipv6/conf/br0/disable_ipv6
> +	echo 1 > /proc/sys/net/ipv6/conf/brq0/disable_ipv6
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +}
> +
> +test_inactive()
> +{
> +	RET=0
> +
> +	test_inactive_nolog
> +	log_test "Mcast inactive test"
> +}
> +
> +wait_lladdr_dad() {
> +	local check_tentative
> +
> +	check_tentative="map(select(.scope == \"link\" and ((.tentative == true) | not))) | .[]"
> +
> +	ip -6 -j a s dev "$1"\
> +		| jq -e ".[].addr_info | ${check_tentative}" &> /dev/null
> +}
> +
> +test_active_setup_bridge()
> +{
> +	[ -n "$1" ] && echo 0 > /proc/sys/net/ipv6/conf/br0/disable_ipv6
> +	[ -n "$2" ] && echo 0 > /proc/sys/net/ipv6/conf/brq0/disable_ipv6
> +
> +	[ -n "$3" ] && ip link set dev br0 up
> +	[ -n "$4" ] && ip link set dev brq0 up
> +	[ -n "$5" ] && slowwait 3 wait_lladdr_dad br0
> +	[ -n "$6" ] && slowwait 3 wait_lladdr_dad brq0
> +}
> +
> +test_active_setup_config()
> +{
> +	[ -n "$1" ] && ip link set dev br0 type bridge mcast_snooping 1
> +	[ -n "$2" ] && ip link set dev brq0 type bridge mcast_snooping 1
> +	[ -n "$3" ] && ip link set dev br0 type bridge mcast_querier 1
> +	[ -n "$4" ] && ip link set dev brq0 type bridge mcast_querier 1
> +}
> +
> +test_active_setup_wait()
> +{
> +	sleep $((${MCAST_MAX_RESP_IVAL_SEC} * 2))
> +}
> +
> +test_active_setup_reset_own_querier()
> +{
> +	ip link set dev br0 type bridge mcast_querier 0
> +	ip link set dev br0 type bridge mcast_querier 1
> +
> +	test_active_setup_wait
> +}
> +
> +test_vlan_active_setup_config()
> +{
> +	[ -n "$1" ] && ip link set dev br0 type bridge vlan_filtering 1
> +	[ -n "$2" ] && ip link set dev brq0 type bridge vlan_filtering 1
> +	[ -n "$3" ] && ip link set dev br0 type bridge mcast_vlan_snooping 1
> +	[ -n "$4" ] && ip link set dev brq0 type bridge mcast_vlan_snooping 1
> +}
> +
> +test_vlan_active_setup_add_vlan()
> +{
> +	bridge vlan add vid ${MCAST_VLAN_ID} dev $swp1
> +	bridge vlan add vid ${MCAST_VLAN_ID} dev $h1
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev br0\
> +		mcast_query_response_interval $((${MCAST_MAX_RESP_IVAL_SEC}*100))
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev brq0\
> +		mcast_query_response_interval $((${MCAST_MAX_RESP_IVAL_SEC}*100))
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_snooping 0
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev brq0 mcast_snooping 0
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_querier 0
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev brq0 mcast_querier 0
> +}
> +
> +test_vlan_active_setup_config_vlan()
> +{
> +	[ -n "$1" ] && bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_snooping 1
> +	[ -n "$2" ] && bridge vlan global set vid ${MCAST_VLAN_ID} dev brq0 mcast_snooping 1
> +	[ -n "$3" ] && bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_querier 1
> +	[ -n "$4" ] && bridge vlan global set vid ${MCAST_VLAN_ID} dev brq0 mcast_querier 1
> +}
> +
> +test_vlan_teardown()
> +{
> +	bridge vlan del vid ${MCAST_VLAN_ID} dev $swp1
> +	bridge vlan del vid ${MCAST_VLAN_ID} dev $h1
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +}
> +
> +test_vlan_active_setup_reset_own_querier()
> +{
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_querier 0
> +	bridge vlan global set vid ${MCAST_VLAN_ID} dev br0 mcast_querier 1
> +
> +	test_active_setup_wait
> +}
> +
> +test_active_other_querier_nolog()
> +{
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  "4"
> +	test_active_setup_wait
> +
> +	mcast_assert_active_v4
> +	mcast_assert_active_v6
> +}
> +
> +test_active_other_querier()
> +{
> +	RET=0
> +
> +	test_active_other_querier_nolog
> +	test_inactive_nolog
> +	log_test "Mcast active with other querier test"
> +}
> +
> +test_active_own_querier_nolog()
> +{
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_active_v4
> +	mcast_assert_active_v6
> +}
> +
> +test_active_own_querier()
> +{
> +	RET=0
> +
> +	test_active_own_querier_nolog
> +	test_inactive_nolog
> +	log_test "Mcast active with own querier test"
> +}
> +
> +test_active_final()
> +{
> +	mcast_assert_active_v4
> +	mcast_assert_active_v6
> +
> +	test_inactive_nolog
> +}
> +
> +test_inactive_brdown()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" ""  "4" ""  "6"
> +	test_active_setup_config "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_bridge ""  ""  "3" ""  ""  ""
> +	mcast_assert_active_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_bridge ""  ""  ""  ""  "5" ""
> +	test_active_setup_reset_own_querier
> +	test_active_final
> +
> +	log_test "Mcast inactive, bridge down test"
> +}
> +
> +test_inactive_nov6()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge ""  "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_active_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_bridge "1" ""  ""  ""  "5" ""
> +	test_active_setup_reset_own_querier
> +	test_active_final
> +
> +	log_test "Mcast inactive, own querier, no IPv6 address test"
> +}
> +
> +test_inactive_snooping_off()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config ""  "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_config "1" ""  ""  ""
> +	test_active_setup_reset_own_querier
> +	test_active_final
> +
> +	log_test "Mcast inactive, snooping disabled test"
> +}
> +
> +test_inactive_querier_off()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_config ""  ""  "3" ""
> +	test_active_setup_wait
> +	test_active_final
> +
> +	log_test "Mcast inactive, no querier test"
> +}
> +
> +test_inactive_other_querier_norespdelay()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" "3" ""
> +	#test_active_setup_wait

Why the comment? There are more instances below

> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_wait
> +	test_active_final
> +
> +	log_test "Mcast inactive, other querier, no response delay test"
> +}
> +
> +test_inactive_own_querier_norespdelay()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  "4"
> +	#test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_wait
> +	test_active_final
> +
> +	log_test "Mcast inactive, own querier, no response delay test"
> +}
> +
> +test_vlan_inactive()
> +{
> +	RET=0
> +
> +	test_inactive_nolog
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	ip link set dev br0 type bridge vlan_filtering 1
> +	ip link set dev br0 type bridge mcast_vlan_snooping 1
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	ip link set dev br0 type bridge mcast_vlan_snooping 0
> +	ip link set dev br0 type bridge vlan_filtering 0
> +	test_active_own_querier_nolog
> +	ip link set dev br0 type bridge vlan_filtering 1
> +	mcast_assert_active_v4
> +	mcast_assert_active_v6
> +
> +	ip link set dev br0 type bridge mcast_vlan_snooping 1
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +
> +	test_inactive_nolog
> +	log_test "Mcast VLAN inactive test"
> +}
> +
> +test_vlan_active_final()
> +{
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_active_v4
> +	mcast_vlan_assert_active_v6
> +
> +	test_vlan_teardown
> +	test_inactive_nolog
> +}
> +
> +test_vlan_active_other_querier()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" ""  "4"
> +	test_active_setup_wait
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN active, other querier test"
> +}
> +
> +test_vlan_active_own_querier()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	test_active_setup_wait
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN active, own querier test"
> +}
> +
> +test_vlan_inactive_brdown()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" ""  "4" ""  "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_active_setup_bridge ""  ""  "3" ""  ""  ""
> +	mcast_vlan_assert_active_v4
> +	mcast_assert_inactive_v6
> +
> +	test_active_setup_bridge ""  ""  ""  ""  "5" ""
> +	test_vlan_active_setup_reset_own_querier
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, bridge down test"
> +}
> +
> +test_vlan_inactive_nov6()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge ""  "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_active_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_active_setup_bridge "1" ""  ""  ""  "5" ""
> +	test_vlan_active_setup_reset_own_querier
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, own querier, no IPv6 address test"
> +}
> +
> +test_vlan_inactive_snooping_off()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config ""  "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_active_setup_config "1" ""  ""  ""
> +	test_vlan_active_setup_reset_own_querier
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, snooping disabled test"
> +}
> +
> +test_vlan_inactive_vlans_snooping_off()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" ""  "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_vlan_active_setup_config ""  ""  "3" ""
> +	test_vlan_active_setup_reset_own_querier
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, snooping for VLANs disabled test"
> +}
> +
> +test_vlan_inactive_vlan_snooping_off()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan ""  "2" "3" ""
> +	test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_vlan_active_setup_config_vlan "1" ""  ""  ""
> +	test_vlan_active_setup_reset_own_querier
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, snooping for this VLAN disabled test"
> +}
> +
> +test_vlan_inactive_other_querier_norespdelay()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" ""  "4"
> +	#test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_active_setup_wait
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, other querier, no response delay test"
> +}
> +
> +test_vlan_inactive_own_querier_norespdelay()
> +{
> +	RET=0
> +
> +	test_active_setup_bridge "1" "2" "3" "4" "5" "6"
> +	test_active_setup_config "1" "2" ""  ""
> +	test_vlan_active_setup_config "1" "2" "3" "4"
> +	test_vlan_active_setup_add_vlan
> +	test_vlan_active_setup_config_vlan "1" "2" "3" ""
> +	#test_active_setup_wait
> +
> +	mcast_assert_inactive_v4
> +	mcast_assert_inactive_v6
> +	mcast_vlan_assert_inactive_v4
> +	mcast_vlan_assert_inactive_v6
> +
> +	test_active_setup_wait
> +	test_vlan_active_final
> +
> +	log_test "Mcast VLAN inactive, own querier, no response delay test"
> +}
> +

The test should check if the new attributes are supported by iproute2
and skip if they are not present. See bridge_activity_notify.sh, for
example.

> +trap cleanup EXIT
> +
> +setup_prepare
> +setup_wait
> +
> +tests_run
> +
> +exit $EXIT_STATUS
> -- 
> 2.51.0
> 

^ permalink raw reply

* Re: [PATCH net-next v2 01/14] net: bridge: mcast: export ip{4,6}_active state to netlink
From: Ido Schimmel @ 2026-02-08 16:00 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-2-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:07AM +0100, Linus Lüssing wrote:
> Export the new ip{4,6}_active variables to netlink, to be able to
> check from userspace that they are updated as intended.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* Re: [PATCH v2 v5.10.y 0/5] Backport fixes for CVE-2025-40149
From: Greg KH @ 2026-02-08 15:06 UTC (permalink / raw)
  To: Keerthana K
  Cc: stable, j.vosburgh, vfalico, andy, davem, kuba, kuznet, yoshfuji,
	borisp, aviadye, john.fastabend, daniel, ast, andrii, kafai,
	songliubraving, yhs, kpsingh, carlos.soto, simon.horman,
	luca.czesla, felix.huettner, ilyal, netdev, bpf, ajay.kaher,
	alexey.makhalov, vamsi-krishna.brahmajosyula, yin.ding,
	tapas.kundu
In-Reply-To: <20260119092602.1414468-1-keerthana.kalyanasundaram@broadcom.com>

On Mon, Jan 19, 2026 at 09:25:57AM +0000, Keerthana K wrote:
> Following commits are pre-requisite for the commit c65f27b9c
> - 1dbf1d590 (net: Add locking to protect skb->dev access in ip_output)
> - 5b9985454 (net/bonding: Take IP hash logic into a helper)
> - 007feb87f (net/bonding: Implement ndo_sk_get_lower_dev)
> - 719a402cf (net: netdevice: Add operation ndo_sk_get_lower_dev)
> 
> Kuniyuki Iwashima (1):
>   tls: Use __sk_dst_get() and dst_dev_rcu() in get_netdev_for_sock().
> 
> Sharath Chandra Vurukala (1):
>   net: Add locking to protect skb->dev access in ip_output
> 
> Tariq Toukan (3):
>   net/bonding: Take IP hash logic into a helper
>   net/bonding: Implement ndo_sk_get_lower_dev
>   net: netdevice: Add operation ndo_sk_get_lower_dev
> 
>  drivers/net/bonding/bond_main.c | 109 ++++++++++++++++++++++++++++++--
>  include/linux/netdevice.h       |   4 ++
>  include/net/bonding.h           |   2 +
>  include/net/dst.h               |  12 ++++
>  net/core/dev.c                  |  33 ++++++++++
>  net/ipv4/ip_output.c            |  16 +++--
>  net/tls/tls_device.c            |  18 +++---
>  7 files changed, 176 insertions(+), 18 deletions(-)
> 
> -- 
> 2.43.7
> 
> 

What changed from v1?

^ permalink raw reply

* Re: [PATCH v2 3/5] net/bonding: Implement ndo_sk_get_lower_dev
From: Greg KH @ 2026-02-08 15:05 UTC (permalink / raw)
  To: Keerthana K
  Cc: stable, j.vosburgh, vfalico, andy, davem, kuba, kuznet, yoshfuji,
	borisp, aviadye, john.fastabend, daniel, ast, andrii, kafai,
	songliubraving, yhs, kpsingh, carlos.soto, simon.horman,
	luca.czesla, felix.huettner, ilyal, netdev, bpf, ajay.kaher,
	alexey.makhalov, vamsi-krishna.brahmajosyula, yin.ding,
	tapas.kundu, Tariq Toukan
In-Reply-To: <20260119092602.1414468-4-keerthana.kalyanasundaram@broadcom.com>

On Mon, Jan 19, 2026 at 09:26:00AM +0000, Keerthana K wrote:
> From: Tariq Toukan <tariqt@nvidia.com>
> 
> [ Upstream commit 719a402cf60311b1cdff3f6320abaecdcc5e46b7 ]

This commit id does not match this patch at all.

Please fix this up and test the series properly and resend it if you
wish to have it applied.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v5.10-v5.15 ] ipv6: use RCU in ip6_xmit()
From: Greg KH @ 2026-02-08 15:00 UTC (permalink / raw)
  To: Keerthana K
  Cc: stable, davem, yoshfuji, dsahern, edumazet, kuba, pabeni, kafai,
	weiwan, netdev, linux-kernel, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Sasha Levin,
	Shivani Agarwal
In-Reply-To: <20260205074644.2091266-1-keerthana.kalyanasundaram@broadcom.com>

On Thu, Feb 05, 2026 at 07:46:44AM +0000, Keerthana K wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> [ Upstream commit 9085e56501d93af9f2d7bd16f7fcfacdde47b99c ]
> 
> Use RCU in ip6_xmit() in order to use dst_dev_rcu() to prevent
> possible UAF.
> 
> Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: David Ahern <dsahern@kernel.org>
> Link: https://patch.msgid.link/20250828195823.3958522-4-edumazet@google.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
> Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com>
> ---
>  net/ipv6/ip6_output.c | 35 +++++++++++++++++++++--------------
>  1 file changed, 21 insertions(+), 14 deletions(-)

We need working versions for newer kernels first.  Please resend this
when you have submitted patches for the newer releases.

thanks,

greg k-h

^ permalink raw reply

* [PATCH] selftests/net: add test for IPv4-in-IPv6 tunneling
From: Linus Heckemann @ 2026-02-08 14:46 UTC (permalink / raw)
  To: edumazet
  Cc: davem, eric.dumazet, horms, kuba, morikw2, netdev, pabeni,
	syzbot+d4dda070f833dc5dc89a, Linus Heckemann

81c734dae203757fb3c9eee6f9896386940776bd was fine in and of itself, but
its backport to 6.12 (and 6.6) broke IPv4-in-IPv6 tunneling, see [1].
This adds a self-test for basic IPv4-in-IPv6 functionality.

[1]: https://lore.kernel.org/all/CAA2RiuSnH_2xc+-W6EnFEG00XjS-dszMq61JEvRjcGS31CBw=g@mail.gmail.com/
---
 tools/testing/selftests/net/Makefile      |  1 +
 tools/testing/selftests/net/ip6_tunnel.sh | 41 +++++++++++++++++++++++
 2 files changed, 42 insertions(+)
 create mode 100644 tools/testing/selftests/net/ip6_tunnel.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 45c4ea381bc36..5037a344ad826 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -43,6 +43,7 @@ TEST_PROGS := \
 	io_uring_zerocopy_tx.sh \
 	ioam6.sh \
 	ip6_gre_headroom.sh \
+	ip6_tunnel.sh \
 	ip_defrag.sh \
 	ip_local_port_range.sh \
 	ipv6_flowlabel.sh \
diff --git a/tools/testing/selftests/net/ip6_tunnel.sh b/tools/testing/selftests/net/ip6_tunnel.sh
new file mode 100644
index 0000000000000..366f4c06cd6a3
--- /dev/null
+++ b/tools/testing/selftests/net/ip6_tunnel.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+# Test that IPv4-over-IPv6 tunneling works.
+
+set -e
+
+setup_prepare() {
+  ip link add transport1 type veth peer name transport2
+
+  ip netns add ns1
+  ip link set transport1 netns ns1
+  ip netns exec ns1 bash <<EOF
+  set -e
+  ip address add 2001:db8::1/64 dev transport1 nodad
+  ip link set transport1 up
+  ip link add link transport1 name tunnel1 type ip6tnl mode ipip6 local 2001:db8::1 remote 2001:db8::2
+  ip address add 172.0.0.1/32 peer 172.0.0.2/32 dev tunnel1
+  ip link set tunnel1 up
+EOF
+
+  ip netns add ns2
+  ip link set transport2 netns ns2
+  ip netns exec ns2 bash <<EOF
+  set -e
+  ip address add 2001:db8::2/64 dev transport2 nodad
+  ip link set transport2 up
+  ip link add link transport2 name tunnel2 type ip6tnl mode ipip6 local 2001:db8::2 remote 2001:db8::1
+  ip address add 172.0.0.2/32 peer 172.0.0.1/32 dev tunnel2
+  ip link set tunnel2 up
+EOF
+}
+
+cleanup() {
+  ip netns delete ns1
+  ip netns delete ns2
+  # in case the namespaces haven't been set up yet
+  ip link delete transport1
+}
+
+trap cleanup EXIT
+setup_prepare
+ip netns exec ns1 ping -W1 -c1 172.0.0.2
-- 
2.52.0


^ permalink raw reply related

* [RFC 3/3] vhost/net: add RX netfilter offload path
From: Cindy Lu @ 2026-02-08 14:32 UTC (permalink / raw)
  To: lulu, mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20260208143441.2177372-1-lulu@redhat.com>

Route RX packets through the netfilter socket when configured.
Key points:
- Add VHOST_NET_FILTER_MAX_LEN upper bound for filter payload size
- Introduce vhost_net_filter_request() to send REQUEST to userspace
- Add handle_rx_filter() fast path for RX when filter is active
- Hook filter path in handle_rx() when filter_sock is set

Signed-off-by: Cindy Lu <lulu@redhat.com>
---
 drivers/vhost/net.c | 229 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f02deff0e53c..aa9a5ed43eae 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -161,6 +161,13 @@ struct vhost_net {
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
 
+/*
+ * Upper bound for a single packet payload on the filter path.
+ * Keep this large enough for the largest expected frame plus vnet headers,
+ * but still bounded to avoid unbounded allocations.
+ */
+#define VHOST_NET_FILTER_MAX_LEN (4096 + 65536)
+
 static void *vhost_net_buf_get_ptr(struct vhost_net_buf *rxq)
 {
 	if (rxq->tail != rxq->head)
@@ -1227,6 +1234,222 @@ static long vhost_net_set_filter(struct vhost_net *n, int fd)
 	return r;
 }
 
+/*
+ * Send a filter REQUEST message to userspace for a single packet.
+ *
+ * The caller provides a writable buffer; userspace may inspect the content and
+ * optionally modify it in place. We only accept the packet if the returned
+ * length matches the original length, otherwise the packet is dropped.
+ */
+static int vhost_net_filter_request(struct vhost_net *n, u16 direction,
+				    void *buf, u32 *len)
+{
+	struct vhost_net_filter_msg msg = {
+		.type = VHOST_NET_FILTER_MSG_REQUEST,
+		.direction = direction,
+		.len = *len,
+	};
+	struct msghdr msghdr = { 0 };
+	struct kvec iov[2] = {
+		{ .iov_base = &msg, .iov_len = sizeof(msg) },
+		{ .iov_base = buf, .iov_len = *len },
+	};
+	struct socket *sock;
+	struct file *sock_file = NULL;
+	int ret;
+
+	/*
+	 * Take a temporary file reference to guard against concurrent
+	 * filter socket replacement while we send the message.
+	 */
+	spin_lock(&n->filter_lock);
+	sock = n->filter_sock;
+	if (sock)
+		sock_file = get_file(sock->file);
+	spin_unlock(&n->filter_lock);
+
+	if (!sock) {
+		ret = -ENOTCONN;
+		goto out_put;
+	}
+
+	ret = kernel_sendmsg(sock, &msghdr, iov,
+			     *len ? 2 : 1, sizeof(msg) + *len);
+
+out_put:
+	if (sock_file)
+		fput(sock_file);
+
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+/*
+ * RX fast path when filter offload is active.
+ *
+ * This mirrors handle_rx() but routes each RX packet through userspace
+ * netfilter. Packets are copied into a temporary buffer, sent to the filter
+ * socket as a REQUEST, and only delivered to the guest if userspace keeps the
+ * length unchanged. Any truncation or mismatch drops the packet.
+ */
+static void handle_rx_filter(struct vhost_net *net, struct socket *sock)
+{
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *vq = &nvq->vq;
+	bool in_order = vhost_has_feature(vq, VIRTIO_F_IN_ORDER);
+	unsigned int count = 0;
+	unsigned int in, log;
+	struct vhost_log *vq_log;
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_flags = MSG_DONTWAIT,
+	};
+	size_t total_len = 0;
+	int mergeable;
+	bool set_num_buffers;
+	size_t vhost_hlen, sock_hlen;
+	size_t vhost_len, sock_len;
+	bool busyloop_intr = false;
+	struct iov_iter fixup;
+	__virtio16 num_buffers;
+	int recv_pkts = 0;
+	unsigned int ndesc;
+	void *pkt;
+
+	pkt = kvmalloc(VHOST_NET_FILTER_MAX_LEN, GFP_KERNEL | __GFP_NOWARN);
+	if (!pkt) {
+		vhost_net_enable_vq(net, vq);
+		return;
+	}
+
+	vhost_hlen = nvq->vhost_hlen;
+	sock_hlen = nvq->sock_hlen;
+
+	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
+	set_num_buffers = mergeable || vhost_has_feature(vq, VIRTIO_F_VERSION_1);
+
+	do {
+		u32 pkt_len;
+		int err;
+		s16 headcount;
+		struct kvec iov;
+
+		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
+						      &busyloop_intr, &count);
+		if (!sock_len)
+			break;
+		sock_len += sock_hlen;
+		if (sock_len > VHOST_NET_FILTER_MAX_LEN) {
+			/* Consume and drop oversized packet. */
+			iov.iov_base = pkt;
+			iov.iov_len = 1;
+			kernel_recvmsg(sock, &msg, &iov, 1, 1,
+				       MSG_DONTWAIT | MSG_TRUNC);
+			continue;
+		}
+
+		vhost_len = sock_len + vhost_hlen;
+		headcount = get_rx_bufs(nvq, vq->heads + count,
+					vq->nheads + count, vhost_len, &in,
+					vq_log, &log,
+					likely(mergeable) ? UIO_MAXIOV : 1,
+					&ndesc);
+		if (unlikely(headcount < 0))
+			goto out;
+
+		if (!headcount) {
+			if (unlikely(busyloop_intr)) {
+				vhost_poll_queue(&vq->poll);
+			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+				vhost_disable_notify(&net->dev, vq);
+				continue;
+			}
+			goto out;
+		}
+
+		busyloop_intr = false;
+
+		if (nvq->rx_ring)
+			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
+
+		iov.iov_base = pkt;
+		iov.iov_len = sock_len;
+		err = kernel_recvmsg(sock, &msg, &iov, 1, sock_len,
+				     MSG_DONTWAIT | MSG_TRUNC);
+		if (unlikely(err != sock_len)) {
+			vhost_discard_vq_desc(vq, headcount, ndesc);
+			continue;
+		}
+
+		pkt_len = sock_len;
+		err = vhost_net_filter_request(net, VHOST_NET_FILTER_DIRECTION_TX,
+					       pkt, &pkt_len);
+		if (err < 0)
+			pkt_len = sock_len;
+		if (pkt_len != sock_len) {
+			vhost_discard_vq_desc(vq, headcount, ndesc);
+			continue;
+		}
+
+		iov_iter_init(&msg.msg_iter, ITER_DEST, vq->iov, in, vhost_len);
+		fixup = msg.msg_iter;
+		if (unlikely(vhost_hlen))
+			iov_iter_advance(&msg.msg_iter, vhost_hlen);
+
+		if (copy_to_iter(pkt, sock_len, &msg.msg_iter) != sock_len) {
+			vhost_discard_vq_desc(vq, headcount, ndesc);
+			goto out;
+		}
+
+		if (unlikely(vhost_hlen)) {
+			if (copy_to_iter(&hdr, sizeof(hdr),
+					 &fixup) != sizeof(hdr)) {
+				vhost_discard_vq_desc(vq, headcount, ndesc);
+				goto out;
+			}
+		} else {
+			iov_iter_advance(&fixup, sizeof(hdr));
+		}
+
+		num_buffers = cpu_to_vhost16(vq, headcount);
+		if (likely(set_num_buffers) &&
+		    copy_to_iter(&num_buffers, sizeof(num_buffers), &fixup) !=
+			    sizeof(num_buffers)) {
+			vhost_discard_vq_desc(vq, headcount, ndesc);
+			goto out;
+		}
+
+		nvq->done_idx += headcount;
+		count += in_order ? 1 : headcount;
+		if (nvq->done_idx > VHOST_NET_BATCH) {
+			vhost_net_signal_used(nvq, count);
+			count = 0;
+		}
+
+		if (unlikely(vq_log))
+			vhost_log_write(vq, vq_log, log, vhost_len, vq->iov, in);
+
+		total_len += vhost_len;
+	} while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
+
+	if (unlikely(busyloop_intr))
+		vhost_poll_queue(&vq->poll);
+	else if (!sock_len)
+		vhost_net_enable_vq(net, vq);
+
+out:
+	vhost_net_signal_used(nvq, count);
+	kvfree(pkt);
+}
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
@@ -1281,6 +1504,11 @@ static void handle_rx(struct vhost_net *net)
 	set_num_buffers = mergeable ||
 			  vhost_has_feature(vq, VIRTIO_F_VERSION_1);
 
+	if (READ_ONCE(net->filter_sock)) {
+		handle_rx_filter(net, sock);
+		goto out_unlock;
+	}
+
 	do {
 		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
 						      &busyloop_intr, &count);
@@ -1383,6 +1611,7 @@ static void handle_rx(struct vhost_net *net)
 		vhost_net_enable_vq(net, vq);
 out:
 	vhost_net_signal_used(nvq, count);
+out_unlock:
 	mutex_unlock(&vq->mutex);
 }
 
-- 
2.52.0


^ permalink raw reply related

* [RFC 2/3] vhost/net: add netfilter socket support
From: Cindy Lu @ 2026-02-08 14:32 UTC (permalink / raw)
  To: lulu, mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20260208143441.2177372-1-lulu@redhat.com>

Introduce the netfilter socket plumbing and the VHOST_NET_SET_FILTER ioctl.
Initialize the netfilter state on open and release it on reset/close.

Key points:
- Add filter_sock + filter_lock to vhost_net
- Validate SOCK_SEQPACKET AF_UNIX filter socket from userspace
- Add vhost_net_set_filter() and VHOST_NET_SET_FILTER ioctl handler
- Initialize filter state on open and clean up on reset/release

Signed-off-by: Cindy Lu <lulu@redhat.com>
---
 drivers/vhost/net.c | 109 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7f886d3dba7d..f02deff0e53c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -131,6 +131,7 @@ struct vhost_net_virtqueue {
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
+
 };
 
 struct vhost_net {
@@ -147,6 +148,15 @@ struct vhost_net {
 	bool tx_flush;
 	/* Private page frag cache */
 	struct page_frag_cache pf_cache;
+
+	/*
+	 * Optional vhost-net filter offload socket.
+	 * When configured, RX packets can be routed through a userspace
+	 * filter chain via a SOCK_SEQPACKET control socket. Access to
+	 * filter_sock is protected by filter_lock.
+	 */
+	struct socket *filter_sock;
+	spinlock_t filter_lock;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -1128,6 +1138,95 @@ static int get_rx_bufs(struct vhost_net_virtqueue *nvq,
 	return r;
 }
 
+/*
+ * Validate and acquire the filter socket from userspace.
+ *
+ * Returns:
+ *   - NULL when fd == -1 (explicitly disable filter)
+ *   - a ref-counted struct socket on success
+ *   - ERR_PTR(-errno) on validation failure
+ */
+static struct socket *get_filter_socket(int fd)
+{
+	int r;
+	struct socket *sock;
+
+	/* Special case: userspace asks to disable filter. */
+	if (fd == -1)
+		return NULL;
+
+	sock = sockfd_lookup(fd, &r);
+	if (!sock)
+		return ERR_PTR(-ENOTSOCK);
+
+	if (sock->sk->sk_family != AF_UNIX ||
+	    sock->sk->sk_type != SOCK_SEQPACKET) {
+		sockfd_put(sock);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return sock;
+}
+
+/*
+ * Drop the currently configured filter socket, if any.
+ *
+ * Caller does not need to hold filter_lock; this function clears the pointer
+ * under the lock and releases the socket reference afterwards.
+ */
+static void vhost_net_filter_stop(struct vhost_net *n)
+{
+	struct socket *sock = n->filter_sock;
+
+	spin_lock(&n->filter_lock);
+	n->filter_sock = NULL;
+	spin_unlock(&n->filter_lock);
+
+	if (sock)
+		sockfd_put(sock);
+}
+
+/*
+ * Install or remove a filter socket for this vhost-net device.
+ *
+ * The ioctl passes an fd for a SOCK_SEQPACKET AF_UNIX socket created by
+ * userspace. We validate the socket type, replace any existing filter socket,
+ * and keep a reference so RX path can safely send filter requests.
+ */
+static long vhost_net_set_filter(struct vhost_net *n, int fd)
+{
+	struct socket *sock;
+	int r;
+
+	mutex_lock(&n->dev.mutex);
+	r = vhost_dev_check_owner(&n->dev);
+	if (r)
+		goto out;
+
+	sock = get_filter_socket(fd);
+	if (IS_ERR(sock)) {
+		r = PTR_ERR(sock);
+		goto out;
+	}
+
+	vhost_net_filter_stop(n);
+
+	if (!sock) {
+		r = 0;
+		goto out;
+	}
+
+	spin_lock(&n->filter_lock);
+	n->filter_sock = sock;
+	spin_unlock(&n->filter_lock);
+
+	r = 0;
+
+out:
+	mutex_unlock(&n->dev.mutex);
+	return r;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
@@ -1383,6 +1482,8 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 
 	f->private_data = n;
 	page_frag_cache_init(&n->pf_cache);
+	spin_lock_init(&n->filter_lock);
+	n->filter_sock = NULL;
 
 	return 0;
 }
@@ -1433,6 +1534,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	struct socket *tx_sock;
 	struct socket *rx_sock;
 
+	vhost_net_filter_stop(n);
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
 	vhost_dev_stop(&n->dev);
@@ -1637,6 +1739,8 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
+
+	vhost_net_filter_stop(n);
 	umem = vhost_dev_reset_owner_prepare();
 	if (!umem) {
 		err = -ENOMEM;
@@ -1737,6 +1841,7 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 	void __user *argp = (void __user *)arg;
 	u64 __user *featurep = argp;
 	struct vhost_vring_file backend;
+	struct vhost_net_filter filter;
 	u64 features, count, copied;
 	int r, i;
 
@@ -1745,6 +1850,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 		if (copy_from_user(&backend, argp, sizeof backend))
 			return -EFAULT;
 		return vhost_net_set_backend(n, backend.index, backend.fd);
+	case VHOST_NET_SET_FILTER:
+		if (copy_from_user(&filter, argp, sizeof(filter)))
+			return -EFAULT;
+		return vhost_net_set_filter(n, filter.fd);
 	case VHOST_GET_FEATURES:
 		features = vhost_net_features[0];
 		if (copy_to_user(featurep, &features, sizeof features))
-- 
2.52.0


^ permalink raw reply related

* [RFC 1/3] uapi: vhost: add vhost-net netfilter offload API
From: Cindy Lu @ 2026-02-08 14:32 UTC (permalink / raw)
  To: lulu, mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20260208143441.2177372-1-lulu@redhat.com>

Add VHOST_NET_SET_FILTER ioctl and the filter socket protocol used for
vhost-net filter offload.

Signed-off-by: Cindy Lu <lulu@redhat.com>
---
 include/uapi/linux/vhost.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index c57674a6aa0d..d9a0ca7a3df0 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -131,6 +131,26 @@
  * device.  This can be used to stop the ring (e.g. for migration). */
 #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
 
+/* VHOST_NET filter offload (kernel vhost-net dataplane through QEMU netfilter) */
+struct vhost_net_filter {
+	__s32 fd;
+};
+
+enum {
+	VHOST_NET_FILTER_MSG_REQUEST = 1,
+};
+
+#define VHOST_NET_FILTER_DIRECTION_TX 1
+
+struct vhost_net_filter_msg {
+	__u16 type;
+	__u16 direction;
+	__u32 len;
+};
+
+
+#define VHOST_NET_SET_FILTER _IOW(VHOST_VIRTIO, 0x31, struct vhost_net_filter)
+
 /* VHOST_SCSI specific defines */
 
 #define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct vhost_scsi_target)
-- 
2.52.0


^ permalink raw reply related

* [RFC 0/3]  vhost-net: netfilter support for RX path
From: Cindy Lu @ 2026-02-08 14:32 UTC (permalink / raw)
  To: lulu, mst, jasowang, kvm, virtualization, netdev, linux-kernel

This series adds a minimal vhost-net filter support for RX.
It introduces a UAPI for VHOST_NET_SET_FILTER and a simple
SOCK_SEQPACKET message header. The kernel side keeps a filter
socket reference and routes RX packets to userspace when
it was enabled.

Tested
- vhost=on  and vhost=off

Cindy Lu (3):
  uapi: vhost: add vhost-net netfilter offload API
  vhost/net: add netfilter socket support
  vhost/net: add RX netfilter offload path

 drivers/vhost/net.c        | 338 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vhost.h |  20 +++
 2 files changed, 358 insertions(+)

-- 
2.52.0

^ permalink raw reply

* Re: [PATCH net-next v20 00/12] virtio_net: Add ethtool flow rules support
From: Dan Jurgens @ 2026-02-08 13:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, kuba, andrew+netdev,
	edumazet
In-Reply-To: <20260208065456-mutt-send-email-mst@kernel.org>

On 2/8/26 5:55 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 05, 2026 at 04:46:55PM -0600, Daniel Jurgens wrote:
>> v15:
>>   - In virtnet_restore_up only call virtnet_close in err path if
>>     netif_running. AI
> 
> what was this AI specifically?
> 

It was the AI review bot, forwarded by Jakub on v16:

> +              * remove_vq_common resets the device and frees the vqs.
> +              */
> +             vi->rx_mode_work_enabled = false;
> +             rtnl_unlock();
> +             remove_vq_common(vi);
> +             return err;

If virtnet_ff_init() fails here, remove_vq_common() frees vi->rq, vi->sq,
and vi->ctrl via virtnet_free_queues(), but the netdevice remains
registered. Could this leave the device in an inconsistent state where
subsequent operations (like virtnet_open() triggered by bringing the
interface up) would access freed memory through vi->rq[i]?

The error return propagates up to virtnet_restore() which just returns
the error without further cleanup. If userspace then tries to use the
still-registered netdevice, virtnet_open() would call try_fill_recv()
which dereferences vi->rq.

> +     }
> +     rtnl_unlock();
> +
>        netif_tx_lock_bh(vi->dev);
>        netif_device_attach(vi->dev);
>        netif_tx_unlock_bh(vi->dev);
> -     return err;
> +     return 0;
>  }
--
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v2] net: mscc: ocelot: add missing lock protection in ocelot_port_xmit()
From: Vladimir Oltean @ 2026-02-08 13:35 UTC (permalink / raw)
  To: Ziyi Guo
  Cc: Paolo Abeni, Claudiu Manoil, Alexandre Belloni, UNGLinuxDriver,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	netdev, linux-kernel
In-Reply-To: <CAMFT1=Y=FbjyqC6vr6fV74t8AmGVcecKw+YjX6V_iA7XgZtotA@mail.gmail.com>

On Wed, Feb 04, 2026 at 11:49:49PM -0600, Ziyi Guo wrote:
> On Tue, Feb 3, 2026 at 5:41â€¯AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > I'm under the impression this static_branch_unlikely() usage is racy,
> > i.e. on CONFIG_PREEMPT kernel execution could enter this branch but not
> > the paired lock.
> >
> > What about moving the 'Check timestamp' block in a separate helper and
> > use a single static_branch_unlikely() branch? something alike the
> > following, completely untested and unfinished:
> >
> >         if (!static_branch_unlikely(&ocelot_fdma_enabled)) {
> >                 int ret = NETDEV_TX_OK;
> >
> >                 ocelot_lock_inj_grp(ocelot, 0);
> >
> >                 if (!ocelot_can_inject(ocelot, 0)) {
> >                         ret = NETDEV_TX_BUSY;
> >                         goto unlock;
> >                 }
> >
> >                 if (!ocelot_timestamp_check())
> >                         goto unlock;
> >
> >
> >                 ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
> >                 consume_skb(skb);
> > unlock:
> >                 ocelot_unlock_inj_grp(ocelot, 0);
> >                 return ret;
> >         }
> >
> >         if (!ocelot_timestamp_check())
> >                 return NETDEV_TX_OK;
> >         ocelot_fdma_inject_frame(ocelot, port, rew_op, skb, dev);
> >         return NETDEV_TX_OK;
> >
> > Well, after scratching the above, I noted it would probably better to
> > invert the two branches...
> 
> 
> Hi Paolo,
> 
> Thank you very much for your review and comments!
> 
> How about we use a new separate helper function like this for previous
> 'Check timestamp' block:
> 
> ```
>   static bool ocelot_xmit_timestamp(struct ocelot *ocelot, int port,
>                                    struct sk_buff *skb, u32 *rew_op)
>   {
>         if (ocelot->ptp && (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
>                 struct sk_buff *clone = NULL;
> 
>                 if (ocelot_port_txtstamp_request(ocelot, port, skb, &clone)) {
>                         kfree_skb(skb);
>                         return false;
>                 }
> 
>                 if (clone)
>                         OCELOT_SKB_CB(skb)->clone = clone;
> 
>                 *rew_op = ocelot_ptp_rew_op(skb);
>         }
> 
>         return true;
>   }
> ```
> 
> So for the function ocelot_port_xmit()
> 
> it will be:
> 
> ```
>   static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct
> net_device *dev)
>   {
>         struct ocelot_port_private *priv = netdev_priv(dev);
>         struct ocelot_port *ocelot_port = &priv->port;
>         struct ocelot *ocelot = ocelot_port->ocelot;
>         int port = priv->port.index;
>         u32 rew_op = 0;
> 
>         /* FDMA path: uses its own locking, handle separately */
>         if (static_branch_unlikely(&ocelot_fdma_enabled)) {
>                 if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op))
>                         return NETDEV_TX_OK;
> 
>                 ocelot_fdma_inject_frame(ocelot, port, rew_op, skb, dev);
>                 return NETDEV_TX_OK;
>         }
> 
>         /* Register injection path: needs inj_lock held throughout */
>         ocelot_lock_inj_grp(ocelot, 0);
> 
>         if (!ocelot_can_inject(ocelot, 0)) {
>                 ocelot_unlock_inj_grp(ocelot, 0);
>                 return NETDEV_TX_BUSY;
>         }
> 
>         if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op)) {
>                 ocelot_unlock_inj_grp(ocelot, 0);
>                 return NETDEV_TX_OK;
>         }
> 
>         ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
> 
>         ocelot_unlock_inj_grp(ocelot, 0);
> 
>         consume_skb(skb);
> 
>         return NETDEV_TX_OK;
>   }
> ```
> 
> Feel free to let me know your thoughts!
> 
> I can send a v3 version patch once we're aligned.

The idea is not bad, but I would move one step further.

Refactor the rew_op handling into an ocelot_xmit_timestamp() function as
a first preparatory patch. The logic will need to be called from two
places and it's good not to duplicate it.

Then create two separate ocelot_port_xmit_fdma() and ocelot_port_xmit_inj(),
as a second preparatory patch.

static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
{
	if (static_branch_unlikely(&ocelot_fdma_enabled))
		return ocelot_port_xmit_fdma(skb, dev);

	return ocelot_port_xmit_inj(skb, dev);
}

Now, as the third patch, add the required locking in ocelot_port_xmit_inj().

It's best for the FDMA vs register injection code paths to be as
separate as possible.

^ permalink raw reply

* Re: [net-next] net: ethernet: ravb: Disable interrupts when closing device
From: Sergey Shtylyov @ 2026-02-08 13:17 UTC (permalink / raw)
  To: Niklas Söderlund, Yoshihiro Shimoda, Paul Barker,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-renesas-soc
In-Reply-To: <20260207184328.2427679-1-niklas.soderlund+renesas@ragnatech.se>

On 2/7/26 9:43 PM, Niklas Söderlund wrote:

> From: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> 
> Disable interrupts when closing the device.

   I think you need to be more specific here: the E-MAC interrupts are
being disabled.

> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
> [Niklas: Rebase from BSP and reword commit message]
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
> ---
>  drivers/net/ethernet/renesas/ravb_main.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 57b0db314fb5..d56b71003585 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -2368,6 +2368,7 @@ static int ravb_close(struct net_device *ndev)
>  	ravb_write(ndev, 0, RIC0);
>  	ravb_write(ndev, 0, RIC2);
>  	ravb_write(ndev, 0, TIC);
> +	ravb_write(ndev, 0, ECSIPR);
>  
>  	/* PHY disconnect */
>  	if (ndev->phydev) {

MBR, Sergey


^ permalink raw reply

* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
From: Vladimir Oltean @ 2026-02-08 12:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Larysa Zaremba, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
	Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
	Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
	Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
	Ricardo B. Marli√®re, Eelco Chaudron, Lorenzo Bianconi,
	Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
	intel-wired-lan, linux-kselftest, Aleksandr Loktionov
In-Reply-To: <20260205175408.30ab72a1@kernel.org>

On Thu, Feb 05, 2026 at 05:54:08PM -0800, Jakub Kicinski wrote:
> FWIW my feeling is that instead of nickel and diming leftover space 
> in the frags if someone actually cared about growing mbufs we should
> have the helper allocate a new page from the PP and append it to the
> shinfo. Much simpler, "infinite space", and works regardless of the
> driver. I don't mean that to suggest you implement it, purely to point
> out that I think nobody really uses positive offsets.. So we can as
> well switch more complicated drivers back to xdp_rxq_info_reg().

FWIW, I do have a use case at least in the theoretical sense for
bpf_xdp_adjust_tail() with positive offsets, although it's still under
development.

I'm working on a DSA data path library for XDP, and one of the features
it supports is redirecting from one user port to another, with in-place
tag modification.

If the path to the egress port goes through a tail-tagging switch but
the path from the ingress port didn't, bpf_xdp_adjust_tail() with a
positive offset will be called to make space for the tail tags.

I'm not sure about the "regardless of the driver" part of your comment.
Is it possible to mix and match allocation models and still keep track
of how each individual page needs to be freed? AFAICS in xdp_return_frame(),
the mem_type is assumed to be the same for the entire xdp_frame.

^ permalink raw reply

* Re: [PATCH net-next v20 00/12] virtio_net: Add ethtool flow rules support
From: Michael S. Tsirkin @ 2026-02-08 11:55 UTC (permalink / raw)
  To: Daniel Jurgens
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, kuba, andrew+netdev,
	edumazet
In-Reply-To: <20260205224707.16995-1-danielj@nvidia.com>

On Thu, Feb 05, 2026 at 04:46:55PM -0600, Daniel Jurgens wrote:
> v15:
>   - In virtnet_restore_up only call virtnet_close in err path if
>     netif_running. AI

what was this AI specifically?


^ permalink raw reply

* Re: [PATCH net-next v20 07/12] virtio_net: Implement layer 2 ethtool flow rules
From: Michael S. Tsirkin @ 2026-02-08 11:54 UTC (permalink / raw)
  To: Daniel Jurgens
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, kuba, andrew+netdev,
	edumazet
In-Reply-To: <20260205224707.16995-8-danielj@nvidia.com>

On Thu, Feb 05, 2026 at 04:47:02PM -0600, Daniel Jurgens wrote:
> @@ -295,8 +301,16 @@ struct virtnet_ff {
>  	struct virtio_net_ff_cap_data *ff_caps;
>  	struct virtio_net_ff_cap_mask_data *ff_mask;
>  	struct virtio_net_ff_actions *ff_actions;
> +	struct xarray classifiers;
> +	int num_classifiers;
> +	struct virtnet_ethtool_ff ethtool;
>  };
>  
> +static int virtnet_ethtool_flow_insert(struct virtnet_ff *ff,
> +				       struct ethtool_rx_flow_spec *fs,
> +				       u16 curr_queue_pairs);
> +static int virtnet_ethtool_flow_remove(struct virtnet_ff *ff, int location);
> +
>  #define VIRTNET_Q_TYPE_RX 0
>  #define VIRTNET_Q_TYPE_TX 1
>  #define VIRTNET_Q_TYPE_CQ 2


btw reordering code so we do not need forward declarations would make
review a tiny bit easier.


-- 
MST


^ permalink raw reply

* Re: [PATCH net-next v20 05/12] virtio_net: Query and set flow filter caps
From: Michael S. Tsirkin @ 2026-02-08 11:51 UTC (permalink / raw)
  To: Daniel Jurgens
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, kuba, andrew+netdev,
	edumazet
In-Reply-To: <20260205224707.16995-6-danielj@nvidia.com>

On Thu, Feb 05, 2026 at 04:47:00PM -0600, Daniel Jurgens wrote:
> When probing a virtnet device, attempt to read the flow filter
> capabilities. In order to use the feature the caps must also
> be set. For now setting what was read is sufficient.
> 
> This patch adds uapi definitions virtio_net flow filters define in
> version 1.4 of the VirtIO spec.
> 
> Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
> 
> ---
> v4:
>     - Validate the length in the selector caps
>     - Removed __free usage.
>     - Removed for(int.
> v5:
>     - Remove unneed () after MAX_SEL_LEN macro (test bot)
> v6:
>     - Fix sparse warning "array of flexible structures" Jakub K/Simon H
>     - Use new variable and validate ff_mask_size before set_cap. MST
> v7:
>     - Set ff->ff_{caps, mask, actions} NULL in error path. Paolo Abeni
>     - Return errors from virtnet_ff_init, -ENOTSUPP is not fatal. Xuan
> 
> v8:
>     - Use real_ff_mask_size when setting the selector caps. Jason Wang
> 
> v9:
>     - Set err after failed memory allocations. Simon Horman
> 
> v10:
>     - Return -EOPNOTSUPP in virnet_ff_init before allocing any memory.
>       Jason/Paolo.
> 
> v11:
>     - Return -EINVAL if any resource limit is 0. Simon Horman
>     - Ensure we don't overrun alloced space of ff->ff_mask by moving the
>       real_ff_mask_size > ff_mask_size check into the loop. Simon Horman
> 
> v12:
>     - Move uapi includes to virtio_net.c vs header file. MST
>     - Remove kernel.h header in virtio_net_ff uapi. MST
>     - WARN_ON_ONCE in error paths validating selectors. MST
>     - Move includes from .h to .c files. MST
>     - Add WARN_ON_ONCE if obj_destroy fails. MST
>     - Comment cleanup in virito_net_ff.h uapi. MST
>     - Add 2 byte pad to the end of virtio_net_ff_cap_data.
>       https://lore.kernel.org/virtio-comment/20251119044029-mutt-send-email-mst@kernel.org/T/#m930988a5d3db316c68546d8b61f4b94f6ebda030
>     - Cleanup and reinit in the freeze/restore path. MST
> 
> v13:
>     - Added /* private: */ comment before reserved field. Jakub
>     - Change ff_mask validation to break at unkonwn selector type. This
>       will allow compatability with newer controllers if the types of
>       selectors is expanded. MST
> 
> v14:
>     - Handle err from virtnet_ff_init in virtnet_restore_up. MST
> 
> v15:
>     - In virtnet_restore_up only call virtnet_close in err path if
>       netif_runnig. AI
> 
> v16:
>     - Return 0 from virtnet_restore_up if virtnet_init_ff return not
>       supported. AI
> 
> v17:
>     - During restore freeze_down on error during ff_init. AI
> 
> v18:
>     - Changed selector cap validation to verify size for each type
>       instead of just checking they weren't bigger than max size. AI
>     - Added __count_by attribute to flexible members in uapi. Paolo A
> 
> v19:
>     - Fixed ;; and incorrect plural in comment. AI
> 
> v20:
>     - include uapi/linux/stddef.h for __counted_by. AI

AI has led you astray, sadly (




> ---
>  drivers/net/virtio_net.c           | 231 ++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_net_ff.h |  91 ++++++++++++
>  2 files changed, 321 insertions(+), 1 deletion(-)
>  create mode 100644 include/uapi/linux/virtio_net_ff.h
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index db88dcaefb20..2cfa37e2f83f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,11 @@
>  #include <net/netdev_rx_queue.h>
>  #include <net/netdev_queues.h>
>  #include <net/xdp_sock_drv.h>
> +#include <linux/virtio_admin.h>
> +#include <net/ipv6.h>
> +#include <net/ip.h>
> +#include <uapi/linux/virtio_pci.h>
> +#include <uapi/linux/virtio_net_ff.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -281,6 +286,14 @@ static const struct virtnet_stat_desc virtnet_stats_tx_speed_desc_qstat[] = {
>  	VIRTNET_STATS_DESC_TX_QSTAT(speed, ratelimit_packets, hw_drop_ratelimits),
>  };
>  
> +struct virtnet_ff {
> +	struct virtio_device *vdev;
> +	bool ff_supported;
> +	struct virtio_net_ff_cap_data *ff_caps;
> +	struct virtio_net_ff_cap_mask_data *ff_mask;
> +	struct virtio_net_ff_actions *ff_actions;
> +};
> +
>  #define VIRTNET_Q_TYPE_RX 0
>  #define VIRTNET_Q_TYPE_TX 1
>  #define VIRTNET_Q_TYPE_CQ 2
> @@ -488,6 +501,7 @@ struct virtnet_info {
>  	TRAILING_OVERLAP(struct virtio_net_rss_config_trailer, rss_trailer, hash_key_data,
>  		u8 rss_hash_key_data[VIRTIO_NET_RSS_MAX_KEY_SIZE];
>  	);
> +	struct virtnet_ff ff;
>  };
>  static_assert(offsetof(struct virtnet_info, rss_trailer.hash_key_data) ==
>  	      offsetof(struct virtnet_info, rss_hash_key_data));
> @@ -526,6 +540,7 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
>  					       struct page *page, void *buf,
>  					       int len, int truesize);
>  static void virtnet_xsk_completed(struct send_queue *sq, int num);
> +static void remove_vq_common(struct virtnet_info *vi);
>  
>  enum virtnet_xmit_type {
>  	VIRTNET_XMIT_TYPE_SKB,
> @@ -5684,6 +5699,192 @@ static const struct netdev_stat_ops virtnet_stat_ops = {
>  	.get_base_stats		= virtnet_get_base_stats,
>  };
>  
> +static size_t get_mask_size(u16 type)
> +{
> +	switch (type) {
> +	case VIRTIO_NET_FF_MASK_TYPE_ETH:
> +		return sizeof(struct ethhdr);
> +	case VIRTIO_NET_FF_MASK_TYPE_IPV4:
> +		return sizeof(struct iphdr);
> +	case VIRTIO_NET_FF_MASK_TYPE_IPV6:
> +		return sizeof(struct ipv6hdr);
> +	case VIRTIO_NET_FF_MASK_TYPE_TCP:
> +		return sizeof(struct tcphdr);
> +	case VIRTIO_NET_FF_MASK_TYPE_UDP:
> +		return sizeof(struct udphdr);
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_ff_init(struct virtnet_ff *ff, struct virtio_device *vdev)
> +{
> +	size_t ff_mask_size = sizeof(struct virtio_net_ff_cap_mask_data) +
> +			      sizeof(struct virtio_net_ff_selector) *
> +			      VIRTIO_NET_FF_MASK_TYPE_MAX;
> +	struct virtio_admin_cmd_query_cap_id_result *cap_id_list;
> +	struct virtio_net_ff_selector *sel;
> +	unsigned long sel_types = 0;
> +	size_t real_ff_mask_size;
> +	int err;
> +	int i;
> +
> +	if (!vdev->config->admin_cmd_exec)
> +		return -EOPNOTSUPP;
> +
> +	cap_id_list = kzalloc(sizeof(*cap_id_list), GFP_KERNEL);
> +	if (!cap_id_list)
> +		return -ENOMEM;
> +
> +	err = virtio_admin_cap_id_list_query(vdev, cap_id_list);
> +	if (err)
> +		goto err_cap_list;
> +
> +	if (!(VIRTIO_CAP_IN_LIST(cap_id_list,
> +				 VIRTIO_NET_FF_RESOURCE_CAP) &&
> +	      VIRTIO_CAP_IN_LIST(cap_id_list,
> +				 VIRTIO_NET_FF_SELECTOR_CAP) &&
> +	      VIRTIO_CAP_IN_LIST(cap_id_list,
> +				 VIRTIO_NET_FF_ACTION_CAP))) {
> +		err = -EOPNOTSUPP;
> +		goto err_cap_list;
> +	}
> +
> +	ff->ff_caps = kzalloc(sizeof(*ff->ff_caps), GFP_KERNEL);
> +	if (!ff->ff_caps) {
> +		err = -ENOMEM;
> +		goto err_cap_list;
> +	}
> +
> +	err = virtio_admin_cap_get(vdev,
> +				   VIRTIO_NET_FF_RESOURCE_CAP,
> +				   ff->ff_caps,
> +				   sizeof(*ff->ff_caps));
> +
> +	if (err)
> +		goto err_ff;
> +
> +	if (!ff->ff_caps->groups_limit ||
> +	    !ff->ff_caps->classifiers_limit ||
> +	    !ff->ff_caps->rules_limit ||
> +	    !ff->ff_caps->rules_per_group_limit) {
> +		err = -EINVAL;
> +		goto err_ff;
> +	}
> +
> +	/* VIRTIO_NET_FF_MASK_TYPE start at 1 */
> +	for (i = 1; i <= VIRTIO_NET_FF_MASK_TYPE_MAX; i++)
> +		ff_mask_size += get_mask_size(i);
> +
> +	ff->ff_mask = kzalloc(ff_mask_size, GFP_KERNEL);
> +	if (!ff->ff_mask) {
> +		err = -ENOMEM;
> +		goto err_ff;
> +	}
> +
> +	err = virtio_admin_cap_get(vdev,
> +				   VIRTIO_NET_FF_SELECTOR_CAP,
> +				   ff->ff_mask,
> +				   ff_mask_size);

So ff_actions is from device and ff_actions->count does not seem to be checked.

If device somehow gains a larger mask down the road, can it not then overflow?
or malicious?


> +
> +	if (err)
> +		goto err_ff_mask;
> +
> +	ff->ff_actions = kzalloc(sizeof(*ff->ff_actions) +
> +					VIRTIO_NET_FF_ACTION_MAX,
> +					GFP_KERNEL);
> +	if (!ff->ff_actions) {
> +		err = -ENOMEM;
> +		goto err_ff_mask;
> +	}
> +
> +	err = virtio_admin_cap_get(vdev,
> +				   VIRTIO_NET_FF_ACTION_CAP,
> +				   ff->ff_actions,
> +				   sizeof(*ff->ff_actions) + VIRTIO_NET_FF_ACTION_MAX);

So ff_actions is from device and ff_actions->count is not checked.

If device gains a ton of actions down the road, can it not then overflow?
or malicious?

> +
> +	if (err)
> +		goto err_ff_action;
> +
> +	err = virtio_admin_cap_set(vdev,
> +				   VIRTIO_NET_FF_RESOURCE_CAP,
> +				   ff->ff_caps,
> +				   sizeof(*ff->ff_caps));
> +	if (err)
> +		goto err_ff_action;
> +
> +	real_ff_mask_size = sizeof(struct virtio_net_ff_cap_mask_data);
> +	sel = (void *)&ff->ff_mask->selectors;
> +
> +	for (i = 0; i < ff->ff_mask->count; i++) {
> +		/* If the selector type is unknown it may indicate the spec
> +		 * has been revised to include new types of selectors
> +		 */
> +		if (sel->type > VIRTIO_NET_FF_MASK_TYPE_MAX)

do you want to check sel->type 0 too?

> +			break;

but count remains unchanged? should we not to reduce count here
so device knows what driver can drive?


> +
> +		if (sel->length != get_mask_size(sel->type) ||
> +		    test_and_set_bit(sel->type, &sel_types)) {
> +			WARN_ON_ONCE(true);
> +			err = -EINVAL;
> +			goto err_ff_action;
> +		}
> +		real_ff_mask_size += sizeof(struct virtio_net_ff_selector) + sel->length;
> +		if (real_ff_mask_size > ff_mask_size) {
> +			WARN_ON_ONCE(true);
> +			err = -EINVAL;
> +			goto err_ff_action;
> +		}
> +		sel = (void *)sel + sizeof(*sel) + sel->length;
> +	}
> +
> +	err = virtio_admin_cap_set(vdev,
> +				   VIRTIO_NET_FF_SELECTOR_CAP,
> +				   ff->ff_mask,
> +				   real_ff_mask_size);
> +	if (err)
> +		goto err_ff_action;
> +
> +	err = virtio_admin_cap_set(vdev,
> +				   VIRTIO_NET_FF_ACTION_CAP,
> +				   ff->ff_actions,
> +				   sizeof(*ff->ff_actions) + VIRTIO_NET_FF_ACTION_MAX);
> +	if (err)
> +		goto err_ff_action;
> +
> +	ff->vdev = vdev;
> +	ff->ff_supported = true;
> +
> +	kfree(cap_id_list);
> +
> +	return 0;
> +
> +err_ff_action:
> +	kfree(ff->ff_actions);
> +	ff->ff_actions = NULL;
> +err_ff_mask:
> +	kfree(ff->ff_mask);
> +	ff->ff_mask = NULL;
> +err_ff:
> +	kfree(ff->ff_caps);
> +	ff->ff_caps = NULL;
> +err_cap_list:
> +	kfree(cap_id_list);
> +
> +	return err;
> +}
> +
> +static void virtnet_ff_cleanup(struct virtnet_ff *ff)
> +{
> +	if (!ff->ff_supported)
> +		return;
> +
> +	kfree(ff->ff_actions);
> +	kfree(ff->ff_mask);
> +	kfree(ff->ff_caps);
> +	ff->ff_supported = false;
> +}
> +
>  static void virtnet_freeze_down(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> @@ -5702,6 +5903,10 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
>  	netif_tx_lock_bh(vi->dev);
>  	netif_device_detach(vi->dev);
>  	netif_tx_unlock_bh(vi->dev);
> +
> +	rtnl_lock();
> +	virtnet_ff_cleanup(&vi->ff);
> +	rtnl_unlock();
>  }
>  
>  static int init_vqs(struct virtnet_info *vi);
> @@ -5727,10 +5932,23 @@ static int virtnet_restore_up(struct virtio_device *vdev)
>  			return err;
>  	}
>  
> +	/* Initialize flow filters. Not supported is an acceptable and common
> +	 * return code
> +	 */
> +	rtnl_lock();
> +	err = virtnet_ff_init(&vi->ff, vi->vdev);
> +	if (err && err != -EOPNOTSUPP) {
> +		rtnl_unlock();
> +		virtnet_freeze_down(vi->vdev);
> +		remove_vq_common(vi);
> +		return err;
> +	}
> +	rtnl_unlock();
> +
>  	netif_tx_lock_bh(vi->dev);
>  	netif_device_attach(vi->dev);
>  	netif_tx_unlock_bh(vi->dev);
> -	return err;
> +	return 0;
>  }
>  
>  static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 offloads)
> @@ -7058,6 +7276,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  	vi->guest_offloads_capable = vi->guest_offloads;
>  
> +	/* Initialize flow filters. Not supported is an acceptable and common
> +	 * return code
> +	 */
> +	err = virtnet_ff_init(&vi->ff, vi->vdev);
> +	if (err && err != -EOPNOTSUPP) {
> +		rtnl_unlock();
> +		goto free_unregister_netdev;
> +	}
> +
>  	rtnl_unlock();
>  
>  	err = virtnet_cpu_notif_add(vi);
> @@ -7073,6 +7300,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>  free_unregister_netdev:
>  	unregister_netdev(dev);
> +	virtnet_ff_cleanup(&vi->ff);
>  free_failover:
>  	net_failover_destroy(vi->failover);
>  free_vqs:
> @@ -7121,6 +7349,7 @@ static void virtnet_remove(struct virtio_device *vdev)
>  	virtnet_free_irq_moder(vi);
>  
>  	unregister_netdev(vi->dev);
> +	virtnet_ff_cleanup(&vi->ff);
>  
>  	net_failover_destroy(vi->failover);
>  
> diff --git a/include/uapi/linux/virtio_net_ff.h b/include/uapi/linux/virtio_net_ff.h
> new file mode 100644
> index 000000000000..552a6b3a8a91
> --- /dev/null
> +++ b/include/uapi/linux/virtio_net_ff.h
> @@ -0,0 +1,91 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> + *
> + * Header file for virtio_net flow filters
> + */
> +#ifndef _LINUX_VIRTIO_NET_FF_H
> +#define _LINUX_VIRTIO_NET_FF_H
> +
> +#include <linux/types.h>
> +#include <uapi/linux/stddef.h>
> +
> +#define VIRTIO_NET_FF_RESOURCE_CAP 0x800
> +#define VIRTIO_NET_FF_SELECTOR_CAP 0x801
> +#define VIRTIO_NET_FF_ACTION_CAP 0x802
> +
> +/**
> + * struct virtio_net_ff_cap_data - Flow filter resource capability limits
> + * @groups_limit: maximum number of flow filter groups supported by the device
> + * @classifiers_limit: maximum number of classifiers supported by the device
> + * @rules_limit: maximum number of rules supported device-wide across all groups
> + * @rules_per_group_limit: maximum number of rules allowed in a single group
> + * @last_rule_priority: priority value associated with the lowest-priority rule
> + * @selectors_per_classifier_limit: maximum selectors allowed in one classifier
> + */
> +struct virtio_net_ff_cap_data {
> +	__le32 groups_limit;
> +	__le32 classifiers_limit;
> +	__le32 rules_limit;
> +	__le32 rules_per_group_limit;
> +	__u8 last_rule_priority;
> +	__u8 selectors_per_classifier_limit;
> +	/* private: */
> +	__u8 reserved[2];
> +};
> +
> +/**
> + * struct virtio_net_ff_selector - Selector mask descriptor
> + * @type: selector type, one of VIRTIO_NET_FF_MASK_TYPE_* constants
> + * @flags: selector flags, see VIRTIO_NET_FF_MASK_F_* constants
> + * @reserved: must be set to 0 by the driver and ignored by the device
> + * @length: size in bytes of @mask
> + * @reserved1: must be set to 0 by the driver and ignored by the device
> + * @mask: variable-length mask payload for @type, length given by @length
> + *
> + * A selector describes a header mask that a classifier can apply. The format
> + * of @mask depends on @type.
> + */
> +struct virtio_net_ff_selector {
> +	__u8 type;
> +	__u8 flags;
> +	__u8 reserved[2];
> +	__u8 length;
> +	__u8 reserved1[3];
> +	__u8 mask[] __counted_by(length);
> +};
> +
> +#define VIRTIO_NET_FF_MASK_TYPE_ETH  1
> +#define VIRTIO_NET_FF_MASK_TYPE_IPV4 2
> +#define VIRTIO_NET_FF_MASK_TYPE_IPV6 3
> +#define VIRTIO_NET_FF_MASK_TYPE_TCP  4
> +#define VIRTIO_NET_FF_MASK_TYPE_UDP  5
> +#define VIRTIO_NET_FF_MASK_TYPE_MAX  VIRTIO_NET_FF_MASK_TYPE_UDP
> +
> +/**
> + * struct virtio_net_ff_cap_mask_data - Supported selector mask formats
> + * @count: number of entries in @selectors
> + * @reserved: must be set to 0 by the driver and ignored by the device
> + * @selectors: packed array of struct virtio_net_ff_selector.
> + */
> +struct virtio_net_ff_cap_mask_data {
> +	__u8 count;
> +	__u8 reserved[7];
> +	__u8 selectors[] __counted_by(count);

This looks wrong to me. count is # of selectors (packed entries) not
bytes.




> +};
> +
> +#define VIRTIO_NET_FF_MASK_F_PARTIAL_MASK (1 << 0)
> +
> +#define VIRTIO_NET_FF_ACTION_DROP 1
> +#define VIRTIO_NET_FF_ACTION_RX_VQ 2
> +#define VIRTIO_NET_FF_ACTION_MAX  VIRTIO_NET_FF_ACTION_RX_VQ
> +/**
> + * struct virtio_net_ff_actions - Supported flow actions
> + * @count: number of supported actions in @actions
> + * @reserved: must be set to 0 by the driver and ignored by the device
> + * @actions: array of action identifiers (VIRTIO_NET_FF_ACTION_*)
> + */
> +struct virtio_net_ff_actions {
> +	__u8 count;
> +	__u8 reserved[7];
> +	__u8 actions[] __counted_by(count);


this too.

> +};
> +#endif
> -- 
> 2.50.1


^ permalink raw reply

* Re: [PATCH net-next v2 03/14] net: bridge: mcast: avoid sleeping on bridge-down
From: Ido Schimmel @ 2026-02-08 11:41 UTC (permalink / raw)
  To: Linus Lüssing
  Cc: bridge, netdev, linux-kernel, Nikolay Aleksandrov, Andrew Lunn,
	Simon Horman, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S . Miller, Kuniyuki Iwashima, Stanislav Fomichev,
	Xiao Liang
In-Reply-To: <20260206030123.5430-4-linus.luessing@c0d3.blue>

On Fri, Feb 06, 2026 at 03:52:09AM +0100, Linus Lüssing wrote:
> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
> index a818fdc22da9..d9d1227d5708 100644
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -168,7 +168,9 @@ static int br_dev_open(struct net_device *dev)
>  	netdev_update_features(dev);
>  	netif_start_queue(dev);
>  	br_stp_enable_bridge(br);
> +	spin_lock_bh(&br->multicast_lock);

This wouldn't work when CONFIG_BRIDGE_IGMP_SNOOPING is not set

>  	br_multicast_open(br);
> +	spin_unlock_bh(&br->multicast_lock);
>  
>  	if (br_opt_get(br, BROPT_MULTICAST_ENABLED))
>  		br_multicast_join_snoopers(br);

^ permalink raw reply

* Re: [PATCH net-next v20 07/12] virtio_net: Implement layer 2 ethtool flow rules
From: Michael S. Tsirkin @ 2026-02-08 11:35 UTC (permalink / raw)
  To: Daniel Jurgens
  Cc: netdev, jasowang, pabeni, virtualization, parav, shshitrit,
	yohadt, xuanzhuo, eperezma, jgg, kevin.tian, kuba, andrew+netdev,
	edumazet
In-Reply-To: <20260205224707.16995-8-danielj@nvidia.com>

On Thu, Feb 05, 2026 at 04:47:02PM -0600, Daniel Jurgens wrote:
> Filtering a flow requires a classifier to match the packets, and a rule
> to filter on the matches.
> 
> A classifier consists of one or more selectors. There is one selector
> per header type. A selector must only use fields set in the selector
> capability. If partial matching is supported, the classifier mask for a
> particular field can be a subset of the mask for that field in the
> capability.
> 
> The rule consists of a priority, an action and a key. The key is a byte
> array containing headers corresponding to the selectors in the
> classifier.
> 
> This patch implements ethtool rules for ethernet headers.
> 
> Example:
> $ ethtool -U ens9 flow-type ether dst 08:11:22:33:44:54 action 30
> Added rule with ID 1
> 
> The rule in the example directs received packets with the specified
> destination MAC address to rq 30.
> 
> Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> v4:
>     - Fixed double free bug in error flows
>     - Build bug on for classifier struct ordering.
>     - (u8 *) to (void *) casting.
>     - Documentation in UAPI
>     - Answered questions about overflow with no changes.
> v6:
>     - Fix sparse warning "array of flexible structures" Jakub K/Simon H
> v7:
>     - Move for (int i -> for (i hunk from next patch. Paolo Abeni
> 
> v12:
>     - Make key_size u8. MST
>     - Free key in insert_rule when it's successful. MST
> 
> v17:
>     - Fix memory leak if validate_classifier_selector fails. AI
> 
> ---
> ---
>  drivers/net/virtio_net.c           | 464 +++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_net_ff.h |  50 ++++
>  2 files changed, 514 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 340104b22a59..27833ba1abee 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -31,6 +31,7 @@
>  #include <net/ip.h>
>  #include <uapi/linux/virtio_pci.h>
>  #include <uapi/linux/virtio_net_ff.h>
> +#include <linux/xarray.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -286,6 +287,11 @@ static const struct virtnet_stat_desc virtnet_stats_tx_speed_desc_qstat[] = {
>  	VIRTNET_STATS_DESC_TX_QSTAT(speed, ratelimit_packets, hw_drop_ratelimits),
>  };
>  
> +struct virtnet_ethtool_ff {
> +	struct xarray rules;
> +	int    num_rules;
> +};
> +
>  #define VIRTNET_FF_ETHTOOL_GROUP_PRIORITY 1
>  #define VIRTNET_FF_MAX_GROUPS 1
>  
> @@ -295,8 +301,16 @@ struct virtnet_ff {
>  	struct virtio_net_ff_cap_data *ff_caps;
>  	struct virtio_net_ff_cap_mask_data *ff_mask;
>  	struct virtio_net_ff_actions *ff_actions;
> +	struct xarray classifiers;
> +	int num_classifiers;
> +	struct virtnet_ethtool_ff ethtool;
>  };
>  
> +static int virtnet_ethtool_flow_insert(struct virtnet_ff *ff,
> +				       struct ethtool_rx_flow_spec *fs,
> +				       u16 curr_queue_pairs);
> +static int virtnet_ethtool_flow_remove(struct virtnet_ff *ff, int location);
> +
>  #define VIRTNET_Q_TYPE_RX 0
>  #define VIRTNET_Q_TYPE_TX 1
>  #define VIRTNET_Q_TYPE_CQ 2
> @@ -5579,6 +5593,21 @@ static u32 virtnet_get_rx_ring_count(struct net_device *dev)
>  	return vi->curr_queue_pairs;
>  }
>  
> +static int virtnet_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	switch (info->cmd) {
> +	case ETHTOOL_SRXCLSRLINS:
> +		return virtnet_ethtool_flow_insert(&vi->ff, &info->fs,
> +						   vi->curr_queue_pairs);
> +	case ETHTOOL_SRXCLSRLDEL:
> +		return virtnet_ethtool_flow_remove(&vi->ff, info->fs.location);
> +	}
> +
> +	return -EOPNOTSUPP;
> +}
> +
>  static const struct ethtool_ops virtnet_ethtool_ops = {
>  	.supported_coalesce_params = ETHTOOL_COALESCE_MAX_FRAMES |
>  		ETHTOOL_COALESCE_USECS | ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
> @@ -5605,6 +5634,7 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>  	.get_rxfh_fields = virtnet_get_hashflow,
>  	.set_rxfh_fields = virtnet_set_hashflow,
>  	.get_rx_ring_count = virtnet_get_rx_ring_count,
> +	.set_rxnfc = virtnet_set_rxnfc,
>  };
>  
>  static void virtnet_get_queue_stats_rx(struct net_device *dev, int i,
> @@ -5702,6 +5732,429 @@ static const struct netdev_stat_ops virtnet_stat_ops = {
>  	.get_base_stats		= virtnet_get_base_stats,
>  };
>  
> +struct virtnet_ethtool_rule {
> +	struct ethtool_rx_flow_spec flow_spec;
> +	u32 classifier_id;
> +};
> +
> +/* The classifier struct must be the last field in this struct */
> +struct virtnet_classifier {
> +	size_t size;
> +	u32 id;
> +	struct virtio_net_resource_obj_ff_classifier classifier;
> +};
> +
> +static_assert(sizeof(struct virtnet_classifier) ==
> +	      ALIGN(offsetofend(struct virtnet_classifier, classifier),
> +		    __alignof__(struct virtnet_classifier)),
> +	      "virtnet_classifier: classifier must be the last member");
> +
> +static bool check_mask_vs_cap(const void *m, const void *c,
> +			      u16 len, bool partial)
> +{
> +	const u8 *mask = m;
> +	const u8 *cap = c;
> +	int i;
> +
> +	for (i = 0; i < len; i++) {
> +		if (partial && ((mask[i] & cap[i]) != mask[i]))
> +			return false;
> +		if (!partial && mask[i] != cap[i])
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static
> +struct virtio_net_ff_selector *get_selector_cap(const struct virtnet_ff *ff,
> +						u8 selector_type)
> +{
> +	struct virtio_net_ff_selector *sel;
> +	void *buf;
> +	int i;
> +
> +	buf = &ff->ff_mask->selectors;
> +	sel = buf;
> +
> +	for (i = 0; i < ff->ff_mask->count; i++) {
> +		if (sel->type == selector_type)
> +			return sel;
> +
> +		buf += sizeof(struct virtio_net_ff_selector) + sel->length;
> +		sel = buf;
> +	}
> +
> +	return NULL;
> +}
> +
> +static bool validate_eth_mask(const struct virtnet_ff *ff,
> +			      const struct virtio_net_ff_selector *sel,
> +			      const struct virtio_net_ff_selector *sel_cap)
> +{
> +	bool partial_mask = !!(sel_cap->flags & VIRTIO_NET_FF_MASK_F_PARTIAL_MASK);
> +	struct ethhdr *cap, *mask;
> +	struct ethhdr zeros = {};
> +
> +	cap = (struct ethhdr *)&sel_cap->mask;
> +	mask = (struct ethhdr *)&sel->mask;
> +
> +	if (memcmp(&zeros.h_dest, mask->h_dest, sizeof(zeros.h_dest)) &&
> +	    !check_mask_vs_cap(mask->h_dest, cap->h_dest,
> +			       sizeof(mask->h_dest), partial_mask))
> +		return false;
> +
> +	if (memcmp(&zeros.h_source, mask->h_source, sizeof(zeros.h_source)) &&
> +	    !check_mask_vs_cap(mask->h_source, cap->h_source,
> +			       sizeof(mask->h_source), partial_mask))
> +		return false;
> +
> +	if (mask->h_proto &&
> +	    !check_mask_vs_cap(&mask->h_proto, &cap->h_proto,
> +			       sizeof(__be16), partial_mask))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool validate_mask(const struct virtnet_ff *ff,
> +			  const struct virtio_net_ff_selector *sel)
> +{
> +	struct virtio_net_ff_selector *sel_cap = get_selector_cap(ff, sel->type);
> +
> +	if (!sel_cap)
> +		return false;
> +
> +	switch (sel->type) {
> +	case VIRTIO_NET_FF_MASK_TYPE_ETH:
> +		return validate_eth_mask(ff, sel, sel_cap);
> +	}
> +
> +	return false;
> +}
> +
> +static int setup_classifier(struct virtnet_ff *ff, struct virtnet_classifier *c)
> +{
> +	int err;
> +
> +	err = xa_alloc(&ff->classifiers, &c->id, c,
> +		       XA_LIMIT(0, le32_to_cpu(ff->ff_caps->classifiers_limit) - 1),
> +		       GFP_KERNEL);
> +	if (err)
> +		return err;
> +
> +	err = virtio_admin_obj_create(ff->vdev,
> +				      VIRTIO_NET_RESOURCE_OBJ_FF_CLASSIFIER,
> +				      c->id,
> +				      VIRTIO_ADMIN_GROUP_TYPE_SELF,
> +				      0,
> +				      &c->classifier,
> +				      c->size);
> +	if (err)
> +		goto err_xarray;
> +
> +	return 0;
> +
> +err_xarray:
> +	xa_erase(&ff->classifiers, c->id);
> +
> +	return err;
> +}
> +
> +static void destroy_classifier(struct virtnet_ff *ff,
> +			       u32 classifier_id)
> +{
> +	struct virtnet_classifier *c;
> +
> +	c = xa_load(&ff->classifiers, classifier_id);
> +	if (c) {
> +		virtio_admin_obj_destroy(ff->vdev,
> +					 VIRTIO_NET_RESOURCE_OBJ_FF_CLASSIFIER,
> +					 c->id,
> +					 VIRTIO_ADMIN_GROUP_TYPE_SELF,
> +					 0);
> +
> +		xa_erase(&ff->classifiers, c->id);
> +		kfree(c);
> +	}
> +}
> +
> +static void destroy_ethtool_rule(struct virtnet_ff *ff,
> +				 struct virtnet_ethtool_rule *eth_rule)
> +{
> +	ff->ethtool.num_rules--;
> +
> +	virtio_admin_obj_destroy(ff->vdev,
> +				 VIRTIO_NET_RESOURCE_OBJ_FF_RULE,
> +				 eth_rule->flow_spec.location,
> +				 VIRTIO_ADMIN_GROUP_TYPE_SELF,
> +				 0);
> +
> +	xa_erase(&ff->ethtool.rules, eth_rule->flow_spec.location);
> +	destroy_classifier(ff, eth_rule->classifier_id);
> +	kfree(eth_rule);
> +}
> +
> +static int insert_rule(struct virtnet_ff *ff,
> +		       struct virtnet_ethtool_rule *eth_rule,
> +		       u32 classifier_id,
> +		       const u8 *key,
> +		       u8 key_size)
> +{
> +	struct ethtool_rx_flow_spec *fs = &eth_rule->flow_spec;
> +	struct virtio_net_resource_obj_ff_rule *ff_rule;
> +	int err;
> +
> +	ff_rule = kzalloc(sizeof(*ff_rule) + key_size, GFP_KERNEL);
> +	if (!ff_rule)
> +		return -ENOMEM;
> +
> +	/* Intentionally leave the priority as 0. All rules have the same
> +	 * priority.
> +	 */
> +	ff_rule->group_id = cpu_to_le32(VIRTNET_FF_ETHTOOL_GROUP_PRIORITY);
> +	ff_rule->classifier_id = cpu_to_le32(classifier_id);
> +	ff_rule->key_length = key_size;
> +	ff_rule->action = fs->ring_cookie == RX_CLS_FLOW_DISC ?
> +					     VIRTIO_NET_FF_ACTION_DROP :
> +					     VIRTIO_NET_FF_ACTION_RX_VQ;

btw driver does not validate these actions are supported or did i miss the
check? the spec does not say it must but it also does not
say how device will behave if not. better not to take risks.


> +	ff_rule->vq_index = fs->ring_cookie != RX_CLS_FLOW_DISC ?
> +					       cpu_to_le16(fs->ring_cookie) : 0;


So here ring_cookie is a vq index? vq index is what matches the spec. but below ...


> +	memcpy(&ff_rule->keys, key, key_size);
> +
> +	err = virtio_admin_obj_create(ff->vdev,
> +				      VIRTIO_NET_RESOURCE_OBJ_FF_RULE,
> +				      fs->location,
> +				      VIRTIO_ADMIN_GROUP_TYPE_SELF,
> +				      0,
> +				      ff_rule,
> +				      sizeof(*ff_rule) + key_size);
> +	if (err)
> +		goto err_ff_rule;
> +
> +	eth_rule->classifier_id = classifier_id;
> +	ff->ethtool.num_rules++;
> +	kfree(ff_rule);
> +	kfree(key);
> +
> +	return 0;
> +
> +err_ff_rule:
> +	kfree(ff_rule);
> +
> +	return err;
> +}
> +
> +static u32 flow_type_mask(u32 flow_type)
> +{
> +	return flow_type & ~(FLOW_EXT | FLOW_MAC_EXT | FLOW_RSS);
> +}
> +
> +static bool supported_flow_type(const struct ethtool_rx_flow_spec *fs)
> +{
> +	switch (fs->flow_type) {
> +	case ETHER_FLOW:
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int validate_flow_input(struct virtnet_ff *ff,
> +			       const struct ethtool_rx_flow_spec *fs,
> +			       u16 curr_queue_pairs)
> +{
> +	/* Force users to use RX_CLS_LOC_ANY - don't allow specific locations */
> +	if (fs->location != RX_CLS_LOC_ANY)
> +		return -EOPNOTSUPP;
> +
> +	if (fs->ring_cookie != RX_CLS_FLOW_DISC &&
> +	    fs->ring_cookie >= curr_queue_pairs)
> +		return -EINVAL;


here ring cookie seems to be a queue pair index?

> +
> +	if (fs->flow_type != flow_type_mask(fs->flow_type))
> +		return -EOPNOTSUPP;
> +
> +	if (!supported_flow_type(fs))
> +		return -EOPNOTSUPP;
> +
> +	return 0;
> +}
> +
> +static void calculate_flow_sizes(struct ethtool_rx_flow_spec *fs,
> +				 u8 *key_size, size_t *classifier_size,
> +				 int *num_hdrs)
> +{
> +	*num_hdrs = 1;
> +	*key_size = sizeof(struct ethhdr);
> +	/*
> +	 * The classifier size is the size of the classifier header, a selector
> +	 * header for each type of header in the match criteria, and each header
> +	 * providing the mask for matching against.
> +	 */
> +	*classifier_size = *key_size +
> +			   sizeof(struct virtio_net_resource_obj_ff_classifier) +
> +			   sizeof(struct virtio_net_ff_selector) * (*num_hdrs);
> +}
> +
> +static void setup_eth_hdr_key_mask(struct virtio_net_ff_selector *selector,
> +				   u8 *key,
> +				   const struct ethtool_rx_flow_spec *fs)
> +{
> +	struct ethhdr *eth_m = (struct ethhdr *)&selector->mask;
> +	struct ethhdr *eth_k = (struct ethhdr *)key;
> +
> +	selector->type = VIRTIO_NET_FF_MASK_TYPE_ETH;
> +	selector->length = sizeof(struct ethhdr);
> +
> +	memcpy(eth_m, &fs->m_u.ether_spec, sizeof(*eth_m));
> +	memcpy(eth_k, &fs->h_u.ether_spec, sizeof(*eth_k));
> +}
> +
> +static int
> +validate_classifier_selectors(struct virtnet_ff *ff,
> +			      struct virtio_net_resource_obj_ff_classifier *classifier,
> +			      int num_hdrs)
> +{
> +	struct virtio_net_ff_selector *selector = (void *)classifier->selectors;
> +	int i;
> +
> +	for (i = 0; i < num_hdrs; i++) {
> +		if (!validate_mask(ff, selector))
> +			return -EINVAL;
> +
> +		selector = (((void *)selector) + sizeof(*selector) +
> +					selector->length);
> +	}
> +
> +	return 0;
> +}
> +
> +static int build_and_insert(struct virtnet_ff *ff,
> +			    struct virtnet_ethtool_rule *eth_rule)
> +{
> +	struct virtio_net_resource_obj_ff_classifier *classifier;
> +	struct ethtool_rx_flow_spec *fs = &eth_rule->flow_spec;
> +	struct virtio_net_ff_selector *selector;
> +	struct virtnet_classifier *c;
> +	size_t classifier_size;
> +	int num_hdrs;
> +	u8 key_size;
> +	u8 *key;
> +	int err;
> +
> +	calculate_flow_sizes(fs, &key_size, &classifier_size, &num_hdrs);
> +
> +	key = kzalloc(key_size, GFP_KERNEL);
> +	if (!key)
> +		return -ENOMEM;
> +
> +	/*
> +	 * virtio_net_ff_obj_ff_classifier is already included in the
> +	 * classifier_size.
> +	 */
> +	c = kzalloc(classifier_size +
> +		    sizeof(struct virtnet_classifier) -
> +		    sizeof(struct virtio_net_resource_obj_ff_classifier),
> +		    GFP_KERNEL);
> +	if (!c) {
> +		kfree(key);
> +		return -ENOMEM;
> +	}
> +
> +	c->size = classifier_size;
> +	classifier = &c->classifier;
> +	classifier->count = num_hdrs;
> +	selector = (void *)&classifier->selectors[0];
> +
> +	setup_eth_hdr_key_mask(selector, key, fs);
> +
> +	err = validate_classifier_selectors(ff, classifier, num_hdrs);
> +	if (err)
> +		goto err_classifier;
> +
> +	err = setup_classifier(ff, c);
> +	if (err)
> +		goto err_classifier;
> +
> +	err = insert_rule(ff, eth_rule, c->id, key, key_size);
> +	if (err) {
> +		/* destroy_classifier will free the classifier */
> +		destroy_classifier(ff, c->id);
> +		goto err_key;
> +	}
> +
> +	return 0;
> +
> +err_classifier:
> +	kfree(c);
> +err_key:
> +	kfree(key);
> +
> +	return err;
> +}
> +
> +static int virtnet_ethtool_flow_insert(struct virtnet_ff *ff,
> +				       struct ethtool_rx_flow_spec *fs,
> +				       u16 curr_queue_pairs)
> +{
> +	struct virtnet_ethtool_rule *eth_rule;
> +	int err;
> +
> +	if (!ff->ff_supported)
> +		return -EOPNOTSUPP;
> +
> +	err = validate_flow_input(ff, fs, curr_queue_pairs);
> +	if (err)
> +		return err;
> +
> +	eth_rule = kzalloc(sizeof(*eth_rule), GFP_KERNEL);
> +	if (!eth_rule)
> +		return -ENOMEM;
> +
> +	err = xa_alloc(&ff->ethtool.rules, &fs->location, eth_rule,
> +		       XA_LIMIT(0, le32_to_cpu(ff->ff_caps->rules_limit) - 1),
> +		       GFP_KERNEL);
> +	if (err)
> +		goto err_rule;
> +
> +	eth_rule->flow_spec = *fs;
> +
> +	err = build_and_insert(ff, eth_rule);
> +	if (err)
> +		goto err_xa;
> +
> +	return err;
> +
> +err_xa:
> +	xa_erase(&ff->ethtool.rules, eth_rule->flow_spec.location);
> +
> +err_rule:
> +	fs->location = RX_CLS_LOC_ANY;
> +	kfree(eth_rule);
> +
> +	return err;
> +}
> +
> +static int virtnet_ethtool_flow_remove(struct virtnet_ff *ff, int location)
> +{
> +	struct virtnet_ethtool_rule *eth_rule;
> +	int err = 0;
> +
> +	if (!ff->ff_supported)
> +		return -EOPNOTSUPP;
> +
> +	eth_rule = xa_load(&ff->ethtool.rules, location);
> +	if (!eth_rule) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +
> +	destroy_ethtool_rule(ff, eth_rule);
> +out:
> +	return err;
> +}
> +
>  static size_t get_mask_size(u16 type)
>  {
>  	switch (type) {
> @@ -5875,6 +6328,8 @@ static int virtnet_ff_init(struct virtnet_ff *ff, struct virtio_device *vdev)
>  	if (err)
>  		goto err_ff_action;
>  
> +	xa_init_flags(&ff->classifiers, XA_FLAGS_ALLOC);
> +	xa_init_flags(&ff->ethtool.rules, XA_FLAGS_ALLOC);
>  	ff->vdev = vdev;
>  	ff->ff_supported = true;
>  
> @@ -5899,9 +6354,18 @@ static int virtnet_ff_init(struct virtnet_ff *ff, struct virtio_device *vdev)
>  
>  static void virtnet_ff_cleanup(struct virtnet_ff *ff)
>  {
> +	struct virtnet_ethtool_rule *eth_rule;
> +	unsigned long i;
> +
>  	if (!ff->ff_supported)
>  		return;
>  
> +	xa_for_each(&ff->ethtool.rules, i, eth_rule)
> +		destroy_ethtool_rule(ff, eth_rule);
> +
> +	xa_destroy(&ff->ethtool.rules);
> +	xa_destroy(&ff->classifiers);
> +
>  	virtio_admin_obj_destroy(ff->vdev,
>  				 VIRTIO_NET_RESOURCE_OBJ_FF_GROUP,
>  				 VIRTNET_FF_ETHTOOL_GROUP_PRIORITY,
> diff --git a/include/uapi/linux/virtio_net_ff.h b/include/uapi/linux/virtio_net_ff.h
> index c7d01e3edc80..aa0619f9c63b 100644
> --- a/include/uapi/linux/virtio_net_ff.h
> +++ b/include/uapi/linux/virtio_net_ff.h
> @@ -13,6 +13,8 @@
>  #define VIRTIO_NET_FF_ACTION_CAP 0x802
>  
>  #define VIRTIO_NET_RESOURCE_OBJ_FF_GROUP 0x0200
> +#define VIRTIO_NET_RESOURCE_OBJ_FF_CLASSIFIER 0x0201
> +#define VIRTIO_NET_RESOURCE_OBJ_FF_RULE 0x0202
>  
>  /**
>   * struct virtio_net_ff_cap_data - Flow filter resource capability limits
> @@ -103,4 +105,52 @@ struct virtio_net_resource_obj_ff_group {
>  	__le16 group_priority;
>  };
>  
> +/**
> + * struct virtio_net_resource_obj_ff_classifier - Flow filter classifier object
> + * @count: number of selector entries in @selectors
> + * @reserved: must be set to 0 by the driver and ignored by the device
> + * @selectors: array of selector descriptors that define match masks
> + *
> + * Payload for the VIRTIO_NET_RESOURCE_OBJ_FF_CLASSIFIER administrative object.
> + * Each selector describes a header mask used to match packets
> + * (see struct virtio_net_ff_selector). Selectors appear in the order they are
> + * to be applied.
> + */
> +struct virtio_net_resource_obj_ff_classifier {
> +	__u8 count;
> +	__u8 reserved[7];
> +	__u8 selectors[];
> +};
> +
> +/**
> + * struct virtio_net_resource_obj_ff_rule - Flow filter rule object
> + * @group_id: identifier of the target flow filter group
> + * @classifier_id: identifier of the classifier referenced by this rule
> + * @rule_priority: relative priority of this rule within the group
> + * @key_length: number of bytes in @keys
> + * @action: action to perform, one of VIRTIO_NET_FF_ACTION_*
> + * @reserved: must be set to 0 by the driver and ignored by the device
> + * @vq_index: RX virtqueue index for VIRTIO_NET_FF_ACTION_RX_VQ, 0 otherwise
> + * @reserved1: must be set to 0 by the driver and ignored by the device
> + * @keys: concatenated key bytes matching the classifier's selectors order
> + *
> + * Payload for the VIRTIO_NET_RESOURCE_OBJ_FF_RULE administrative object.
> + * @group_id and @classifier_id refer to previously created objects of types
> + * VIRTIO_NET_RESOURCE_OBJ_FF_GROUP and VIRTIO_NET_RESOURCE_OBJ_FF_CLASSIFIER
> + * respectively. The key bytes are compared against packet headers using the
> + * masks provided by the classifier's selectors. Multi-byte fields are
> + * little-endian.
> + */
> +struct virtio_net_resource_obj_ff_rule {
> +	__le32 group_id;
> +	__le32 classifier_id;
> +	__u8 rule_priority;
> +	__u8 key_length; /* length of key in bytes */
> +	__u8 action;
> +	__u8 reserved;
> +	__le16 vq_index;
> +	__u8 reserved1[2];
> +	__u8 keys[];
> +};
> +
>  #endif
> -- 
> 2.50.1


^ permalink raw reply

* [PATCH net] net: flow_offload: protect driver_block_list in flow_block_cb_setup_simple()
From: Shigeru Yoshida @ 2026-02-08 11:00 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Shigeru Yoshida
  Cc: syzbot+5a66db916cdde0dbcc1c, netdev, linux-kernel,
	netfilter-devel, coreteam

syzbot reported a list_del corruption in flow_block_cb_setup_simple(). [0]

flow_block_cb_setup_simple() accesses the driver_block_list (e.g.,
netdevsim's nsim_block_cb_list) without any synchronization. The
nftables offload path calls into this function via ndo_setup_tc while
holding the per-netns commit_mutex, but this mutex does not prevent
concurrent access from tasks in different network namespaces that
share the same driver_block_list, leading to list corruption:

- Task A (FLOW_BLOCK_BIND) calls list_add_tail() to insert a new
  flow_block_cb into driver_block_list.

- Task B (FLOW_BLOCK_UNBIND) concurrently calls list_del() on another
  flow_block_cb from the same list.

- The concurrent modifications corrupt the list pointers.

Fix this by adding a static mutex (flow_block_cb_list_lock) that
protects all driver_block_list operations within
flow_block_cb_setup_simple(). Also add a flow_block_cb_remove_driver()
helper for external callers that need to remove a block_cb from the
driver list under the same lock, and convert nft_indr_block_cleanup()
to use it.

[0]:
list_del corruption. prev->next should be ffff888028878200, but was ffffffff8e940fc0. (prev=ffffffff8e940fc0)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:64!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6308 Comm: syz.3.231 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62
[...]
Call Trace:
 <TASK>
 __list_del_entry_valid include/linux/list.h:124 [inline]
 __list_del_entry include/linux/list.h:215 [inline]
 list_del include/linux/list.h:229 [inline]
 flow_block_cb_setup_simple+0x62d/0x740 net/core/flow_offload.c:369
 nft_block_offload_cmd net/netfilter/nf_tables_offload.c:397 [inline]
 nft_chain_offload_cmd+0x293/0x660 net/netfilter/nf_tables_offload.c:451
 nft_flow_block_chain net/netfilter/nf_tables_offload.c:471 [inline]
 nft_flow_offload_chain net/netfilter/nf_tables_offload.c:513 [inline]
 nft_flow_rule_offload_commit+0x40d/0x1b60 net/netfilter/nf_tables_offload.c:592
 nf_tables_commit+0x675/0x8710 net/netfilter/nf_tables_api.c:10925
 nfnetlink_rcv_batch net/netfilter/nfnetlink.c:576 [inline]
 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:649 [inline]
 nfnetlink_rcv+0x1ac9/0x2590 net/netfilter/nfnetlink.c:667
 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
 netlink_unicast+0x82c/0x9e0 net/netlink/af_netlink.c:1346
 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
 sock_sendmsg_nosec net/socket.c:727 [inline]
 __sock_sendmsg+0x219/0x270 net/socket.c:742
 ____sys_sendmsg+0x505/0x830 net/socket.c:2630
 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
 __sys_sendmsg net/socket.c:2716 [inline]
 __do_sys_sendmsg net/socket.c:2721 [inline]
 __se_sys_sendmsg net/socket.c:2719 [inline]
 __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
Reported-by: syzbot+5a66db916cdde0dbcc1c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5a66db916cdde0dbcc1c
Tested-by: syzbot+5a66db916cdde0dbcc1c@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
---
 include/net/flow_offload.h        |  2 ++
 net/core/flow_offload.c           | 41 ++++++++++++++++++++++++-------
 net/netfilter/nf_tables_offload.c |  2 +-
 3 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 596ab9791e4d..ff6d2bcb2cca 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -673,6 +673,8 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
 			       flow_setup_cb_t *cb,
 			       void *cb_ident, void *cb_priv, bool ingress_only);
 
+void flow_block_cb_remove_driver(struct flow_block_cb *block_cb);
+
 enum flow_cls_command {
 	FLOW_CLS_REPLACE,
 	FLOW_CLS_DESTROY,
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index bc5169482710..137a44af5e1c 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -334,6 +334,8 @@ bool flow_block_cb_is_busy(flow_setup_cb_t *cb, void *cb_ident,
 }
 EXPORT_SYMBOL(flow_block_cb_is_busy);
 
+static DEFINE_MUTEX(flow_block_cb_list_lock);
+
 int flow_block_cb_setup_simple(struct flow_block_offload *f,
 			       struct list_head *driver_block_list,
 			       flow_setup_cb_t *cb,
@@ -341,6 +343,7 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
 			       bool ingress_only)
 {
 	struct flow_block_cb *block_cb;
+	int err = 0;
 
 	if (ingress_only &&
 	    f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
@@ -348,32 +351,52 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
 
 	f->driver_block_list = driver_block_list;
 
+	mutex_lock(&flow_block_cb_list_lock);
+
 	switch (f->command) {
 	case FLOW_BLOCK_BIND:
-		if (flow_block_cb_is_busy(cb, cb_ident, driver_block_list))
-			return -EBUSY;
+		if (flow_block_cb_is_busy(cb, cb_ident, driver_block_list)) {
+			err = -EBUSY;
+			break;
+		}
 
 		block_cb = flow_block_cb_alloc(cb, cb_ident, cb_priv, NULL);
-		if (IS_ERR(block_cb))
-			return PTR_ERR(block_cb);
+		if (IS_ERR(block_cb)) {
+			err = PTR_ERR(block_cb);
+			break;
+		}
 
 		flow_block_cb_add(block_cb, f);
 		list_add_tail(&block_cb->driver_list, driver_block_list);
-		return 0;
+		break;
 	case FLOW_BLOCK_UNBIND:
 		block_cb = flow_block_cb_lookup(f->block, cb, cb_ident);
-		if (!block_cb)
-			return -ENOENT;
+		if (!block_cb) {
+			err = -ENOENT;
+			break;
+		}
 
 		flow_block_cb_remove(block_cb, f);
 		list_del(&block_cb->driver_list);
-		return 0;
+		break;
 	default:
-		return -EOPNOTSUPP;
+		err = -EOPNOTSUPP;
+		break;
 	}
+
+	mutex_unlock(&flow_block_cb_list_lock);
+	return err;
 }
 EXPORT_SYMBOL(flow_block_cb_setup_simple);
 
+void flow_block_cb_remove_driver(struct flow_block_cb *block_cb)
+{
+	mutex_lock(&flow_block_cb_list_lock);
+	list_del(&block_cb->driver_list);
+	mutex_unlock(&flow_block_cb_list_lock);
+}
+EXPORT_SYMBOL(flow_block_cb_remove_driver);
+
 static DEFINE_MUTEX(flow_indr_block_lock);
 static LIST_HEAD(flow_block_indr_list);
 static LIST_HEAD(flow_block_indr_dev_list);
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index fd30e205de84..d60838bceafb 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -414,7 +414,7 @@ static void nft_indr_block_cleanup(struct flow_block_cb *block_cb)
 				    basechain, &extack);
 	nft_net = nft_pernet(net);
 	mutex_lock(&nft_net->commit_mutex);
-	list_del(&block_cb->driver_list);
+	flow_block_cb_remove_driver(block_cb);
 	list_move(&block_cb->list, &bo.cb_list);
 	nft_flow_offload_unbind(&bo, basechain);
 	mutex_unlock(&nft_net->commit_mutex);
-- 
2.52.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox