Netdev List
 help / color / mirror / Atom feed
* [PATCH net 3/4] vlan: defer real device state propagation to netdev_work
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski,
	syzbot+09da62a8b78959ceb8bb, syzbot+cb67c392b0b8f0fd0fc1,
	syzbot+9bb8bd77f3966641f298
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>

vlan_device_event() generates nested UP/DOWN, MTU and feature
change events. It executes an event for the VLAN device directly
from the notifier - while the locks of the lower device are held.

This causes deadlocks, for example:

  bond    (3) bond_update_speed_duplex(vlan)
    |           ^                v
  vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
    |           ^                v
  dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()

The dummy device is ops locked, vlan creates a nested event (2),
then bond wants to ask vlan for link state (3). bond uses the
"I'm already holding the instance lock" flavor of API. But in
this case the lock held refers to vlan itself. We hit vlan's
link settings trampoline (4) and call __ethtool_get_link_ksettings()
which tries to lock dummy. Deadlock. There's no clean way for us
to tell the vlan_ethtool_get_link_ksettings() that the caller
is already in lower device's critical section.

Defer the propagation to the per-netdev work facility instead:
the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*),
and ndo_work (vlan_dev_work) applies the change later. Hopefully
nobody expects the VLAN state changes to be instantaneous.

If someone does expect the changes to be instantaneous we will
have to do the same thing Stan did for rx_mode and "strategically"
place sync calls, to make sure such delayed works are executed
after we drop the ops lock but before we drop rtnl_lock.

Stan suggests that if we need that down the line we may
consider reshaping the mechanism into "async notifications".
AFAICT only vlan does this sort of netdev open chaining,
so as a first try I think that sticking the complexity into
the vlan code makes sense.

One corner case is that we need to cancel the event if user
explicitly changes the state before work could run. Consider
the following operations with vlan0 on top of dummy0:

  ip link set dev dummy0 up    # queues work to up vlan0
  ip link set dev vlan0 down   # user explicitly downs the vlan
  ndo_work                     # acts on the stale event

Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com
Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com
Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com
Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/netdevices.rst |  2 +
 net/8021q/vlan.h                        | 11 ++++
 net/8021q/vlan.c                        | 76 +++----------------------
 net/8021q/vlan_dev.c                    | 60 +++++++++++++++++++
 net/core/dev.c                          |  1 +
 5 files changed, 82 insertions(+), 68 deletions(-)

diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index fde601acd1d2..d2a238f8cc8b 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -433,6 +433,8 @@ exceptions) notifiers run under the instance lock. Please extend this
 documentation whenever you make explicit assumption about lock being held
 from a notifier.
 
+Drivers **must not** generate nested notifications of the ops-locked types.
+
 NETDEV_INTERNAL symbol namespace
 ================================
 
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index c7ffe591d593..c41caaf94095 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -125,6 +125,17 @@ static inline netdev_features_t vlan_tnl_features(struct net_device *real_dev)
 int vlan_filter_push_vids(struct vlan_info *vlan_info, __be16 proto);
 void vlan_filter_drop_vids(struct vlan_info *vlan_info, __be16 proto);
 
+/* netdev_work events propagated from the real device, see vlan_dev_work(). */
+enum {
+	VLAN_WORK_LINK_STATE	= BIT(0), /* sync up/down with real_dev */
+	VLAN_WORK_MTU		= BIT(1), /* clamp mtu to real_dev's */
+	VLAN_WORK_FEATURES	= BIT(2), /* re-inherit real_dev features */
+};
+
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+				     struct net_device *dev,
+				     struct vlan_dev_priv *vlan);
+
 /* found in vlan_dev.c */
 void vlan_dev_set_ingress_priority(const struct net_device *dev,
 				   u32 skb_prio, u16 vlan_prio);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 2b74ed56eb16..2d2efb877975 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -77,9 +77,9 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 	return 0;
 }
 
-static void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
-					    struct net_device *dev,
-					    struct vlan_dev_priv *vlan)
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+				     struct net_device *dev,
+				     struct vlan_dev_priv *vlan)
 {
 	if (!(vlan->flags & VLAN_FLAG_BRIDGE_BINDING))
 		netif_stacked_transfer_operstate(rootdev, dev);
@@ -316,29 +316,6 @@ static void vlan_sync_address(struct net_device *dev,
 	ether_addr_copy(vlan->real_dev_addr, dev->dev_addr);
 }
 
-static void vlan_transfer_features(struct net_device *dev,
-				   struct net_device *vlandev)
-{
-	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
-
-	netif_inherit_tso_max(vlandev, dev);
-
-	if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
-		vlandev->hard_header_len = dev->hard_header_len;
-	else
-		vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
-
-#if IS_ENABLED(CONFIG_FCOE)
-	vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
-#endif
-
-	vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
-	vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
-	vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
-
-	netdev_update_features(vlandev);
-}
-
 static int __vlan_device_event(struct net_device *dev, unsigned long event)
 {
 	int err = 0;
@@ -391,13 +368,11 @@ static void vlan_vid0_del(struct net_device *dev)
 static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 			     void *ptr)
 {
-	struct netlink_ext_ack *extack = netdev_notifier_info_to_extack(ptr);
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct vlan_group *grp;
 	struct vlan_info *vlan_info;
 	int i, flgs;
 	struct net_device *vlandev;
-	struct vlan_dev_priv *vlan;
 	bool last = false;
 	LIST_HEAD(list);
 	int err;
@@ -447,54 +422,19 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 			if (vlandev->mtu <= dev->mtu)
 				continue;
 
-			dev_set_mtu(vlandev, dev->mtu);
+			netdev_work_sched(vlandev, VLAN_WORK_MTU);
 		}
 		break;
 
 	case NETDEV_FEAT_CHANGE:
-		/* Propagate device features to underlying device */
 		vlan_group_for_each_dev(grp, i, vlandev)
-			vlan_transfer_features(dev, vlandev);
+			netdev_work_sched(vlandev, VLAN_WORK_FEATURES);
 		break;
 
-	case NETDEV_DOWN: {
-		struct net_device *tmp;
-		LIST_HEAD(close_list);
-
-		/* Put all VLANs for this dev in the down state too.  */
-		vlan_group_for_each_dev(grp, i, vlandev) {
-			flgs = vlandev->flags;
-			if (!(flgs & IFF_UP))
-				continue;
-
-			vlan = vlan_dev_priv(vlandev);
-			if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
-				list_add(&vlandev->close_list, &close_list);
-		}
-
-		netif_close_many(&close_list, false);
-
-		list_for_each_entry_safe(vlandev, tmp, &close_list, close_list) {
-			vlan_stacked_transfer_operstate(dev, vlandev,
-							vlan_dev_priv(vlandev));
-			list_del_init(&vlandev->close_list);
-		}
-		list_del(&close_list);
-		break;
-	}
+	case NETDEV_DOWN:
 	case NETDEV_UP:
-		/* Put all VLANs for this dev in the up state too.  */
-		vlan_group_for_each_dev(grp, i, vlandev) {
-			flgs = netif_get_flags(vlandev);
-			if (flgs & IFF_UP)
-				continue;
-
-			vlan = vlan_dev_priv(vlandev);
-			if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
-				dev_change_flags(vlandev, flgs | IFF_UP,
-						 extack);
-			vlan_stacked_transfer_operstate(dev, vlandev, vlan);
-		}
+		vlan_group_for_each_dev(grp, i, vlandev)
+			netdev_work_sched(vlandev, VLAN_WORK_LINK_STATE);
 		break;
 
 	case NETDEV_UNREGISTER:
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 7aa3af8b10ea..ec2569b3f8da 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -270,6 +270,9 @@ static int vlan_dev_open(struct net_device *dev)
 	    !(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
 		return -ENETDOWN;
 
+	/* The explicit open supersedes any deferred link-state sync */
+	netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
 	if (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr) &&
 	    !vlan_dev_inherit_address(dev, real_dev)) {
 		err = dev_uc_add(real_dev, dev->dev_addr);
@@ -300,6 +303,9 @@ static int vlan_dev_stop(struct net_device *dev)
 	struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
 	struct net_device *real_dev = vlan->real_dev;
 
+	/* The explicit close supersedes any deferred link-state sync */
+	netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
 	dev_mc_unsync(real_dev, dev);
 	dev_uc_unsync(real_dev, dev);
 
@@ -1016,6 +1022,59 @@ static const struct ethtool_ops vlan_ethtool_ops = {
 	.get_ts_info		= vlan_ethtool_get_ts_info,
 };
 
+static void vlan_transfer_features(struct net_device *dev,
+				   struct net_device *vlandev)
+{
+	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+
+	netif_inherit_tso_max(vlandev, dev);
+
+	if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
+		vlandev->hard_header_len = dev->hard_header_len;
+	else
+		vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
+
+#if IS_ENABLED(CONFIG_FCOE)
+	vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
+#endif
+
+	vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
+	vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
+
+	netdev_update_features(vlandev);
+}
+
+static void vlan_dev_work(struct net_device *vlandev, unsigned long events)
+{
+	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+	struct net_device *real_dev = vlan->real_dev;
+	bool loose = vlan->flags & VLAN_FLAG_LOOSE_BINDING;
+	unsigned int flgs;
+
+	if (events & VLAN_WORK_LINK_STATE) {
+		flgs = netif_get_flags(vlandev);
+		if (real_dev->flags & IFF_UP) {
+			if (!(flgs & IFF_UP)) {
+				if (!loose)
+					netif_change_flags(vlandev,
+							   flgs | IFF_UP, NULL);
+				vlan_stacked_transfer_operstate(real_dev,
+								vlandev, vlan);
+			}
+		} else if ((flgs & IFF_UP) && !loose) {
+			netif_change_flags(vlandev, flgs & ~IFF_UP, NULL);
+			vlan_stacked_transfer_operstate(real_dev, vlandev, vlan);
+		}
+	}
+
+	if ((events & VLAN_WORK_MTU) && vlandev->mtu > real_dev->mtu)
+		netif_set_mtu(vlandev, real_dev->mtu);
+
+	if (events & VLAN_WORK_FEATURES)
+		vlan_transfer_features(real_dev, vlandev);
+}
+
 static const struct net_device_ops vlan_netdev_ops = {
 	.ndo_change_mtu		= vlan_dev_change_mtu,
 	.ndo_init		= vlan_dev_init,
@@ -1027,6 +1086,7 @@ static const struct net_device_ops vlan_netdev_ops = {
 	.ndo_set_mac_address	= vlan_dev_set_mac_address,
 	.ndo_set_rx_mode	= vlan_dev_set_rx_mode,
 	.ndo_change_rx_flags	= vlan_dev_change_rx_flags,
+	.ndo_work		= vlan_dev_work,
 	.ndo_eth_ioctl		= vlan_dev_ioctl,
 	.ndo_neigh_setup	= vlan_dev_neigh_setup,
 	.ndo_get_stats64	= vlan_dev_get_stats64,
diff --git a/net/core/dev.c b/net/core/dev.c
index e1d8af0ef6ab..4b3d5cfdf6e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9822,6 +9822,7 @@ int netif_change_flags(struct net_device *dev, unsigned int flags,
 	__dev_notify_flags(dev, old_flags, changes, 0, NULL);
 	return ret;
 }
+EXPORT_SYMBOL(netif_change_flags);
 
 int __netif_set_mtu(struct net_device *dev, int new_mtu)
 {
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>

With an extra event mask we can easily extend the netdev work
to also service driver-defined events. For advanced drivers
this is probably not a perfect match, but it makes running
deferred work easier in simple cases.

Expose the netdev_work facility to drivers. Add helpers
to schedule work and a dedicated ndo to perform the driver-
-scheduled actions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/linux/netdevice.h | 11 ++++++
 net/core/netdev_work.c    | 81 ++++++++++++++++++++++++++++++---------
 2 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 732506787db3..9981d637f8b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1131,6 +1131,9 @@ struct netdev_net_notifier {
  *	netdev_hw_addr_list_for_each(ha, uc). Return 0 on success or a
  *	negative errno to request a retry via the core backoff.
  *
+ * void (*ndo_work)(struct net_device *dev, unsigned long events);
+ *	Run deferred work scheduled with netdev_work_sched(@events).
+ *
  * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
  *	This function  is called when the Media Access Control address
  *	needs to be changed. If this interface is not defined, the
@@ -1460,6 +1463,8 @@ struct net_device_ops {
 					struct net_device *dev,
 					struct netdev_hw_addr_list *uc,
 					struct netdev_hw_addr_list *mc);
+	void			(*ndo_work)(struct net_device *dev,
+					    unsigned long events);
 	int			(*ndo_set_mac_address)(struct net_device *dev,
 						       void *addr);
 	int			(*ndo_validate_addr)(struct net_device *dev);
@@ -1932,6 +1937,8 @@ enum netdev_reg_state {
  *				does not implement ndo_set_rx_mode()
  *	@work_node:		List entry for async netdev_work processing
  *	@work_tracker:		Refcount tracker for async netdev_work
+ *	@work_pending:		Driver-defined pending netdev_work, passed to
+ *				ndo_work() (see netdev_work_sched())
  *	@work_core_pending:	Core-defined pending netdev_work (NETDEV_WORK_*)
  *	@rx_mode_addr_cache:	Recycled snapshot entries for rx_mode work
  *	@rx_mode_retry_timer:	Timer that re-queues rx_mode work after failure
@@ -2329,6 +2336,7 @@ struct net_device {
 	bool			uc_promisc;
 	struct list_head	work_node;
 	netdevice_tracker	work_tracker;
+	unsigned long		work_pending;
 	unsigned long		work_core_pending;
 	struct netdev_hw_addr_list	rx_mode_addr_cache;
 	struct timer_list	rx_mode_retry_timer;
@@ -5178,6 +5186,9 @@ void dev_fetch_sw_netstats(struct rtnl_link_stats64 *s,
 			   const struct pcpu_sw_netstats __percpu *netstats);
 void dev_get_tstats64(struct net_device *dev, struct rtnl_link_stats64 *s);
 
+void netdev_work_sched(struct net_device *dev, unsigned long events);
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask);
+
 enum {
 	NESTED_SYNC_IMM_BIT,
 	NESTED_SYNC_TODO_BIT,
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
index c121c24dc493..3109fae132ad 100644
--- a/net/core/netdev_work.c
+++ b/net/core/netdev_work.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <linux/export.h>
 #include <linux/list.h>
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
@@ -16,32 +17,63 @@ static void netdev_work_proc(struct work_struct *work);
  *  - within the list entries (struct net_device fields):
  *	- work_node
  *	- work_tracker
+ *	- work_pending
  *	- work_core_pending
  */
 static LIST_HEAD(netdev_work_list);
 static DEFINE_SPINLOCK(netdev_work_lock);
 static DECLARE_WORK(netdev_work, netdev_work_proc);
 
-void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+static void netdev_work_enqueue(struct net_device *dev, unsigned long events,
+				unsigned long core)
 {
+	if (!events && !core)
+		return;
+
 	spin_lock_bh(&netdev_work_lock);
 	if (list_empty(&dev->work_node)) {
 		list_add_tail(&dev->work_node, &netdev_work_list);
 		netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
 	}
-	dev->work_core_pending |= event;
+	dev->work_pending |= events;
+	dev->work_core_pending |= core;
 	spin_unlock_bh(&netdev_work_lock);
 
 	schedule_work(&netdev_work);
 }
 
+static unsigned long
+netdev_work_dequeue(struct net_device *dev, unsigned long *pending,
+		    unsigned long mask)
+{
+	unsigned long events;
+
+	spin_lock_bh(&netdev_work_lock);
+	events = *pending & mask;
+	*pending &= ~events;
+	if (!list_empty(&dev->work_node) &&
+	    !dev->work_pending && !dev->work_core_pending) {
+		list_del_init(&dev->work_node);
+		netdev_put(dev, &dev->work_tracker);
+	}
+	spin_unlock_bh(&netdev_work_lock);
+
+	return events;
+}
+
+void netdev_work_sched(struct net_device *dev, unsigned long events)
+{
+	netdev_work_enqueue(dev, events, 0);
+}
+EXPORT_SYMBOL(netdev_work_sched);
+
 /**
- * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * netdev_work_cancel() - cancel selected work for a netdev
  * @dev: net_device
  * @mask: events to cancel
  *
  * Clear @mask from the device's work pending mask. If no work is left pending
- * the device is dequeued.
+ * the device is dequeued and its ndo_work won't be called.
  *
  * No expectations on locking, but also no guarantees provided. If the caller
  * wants to touch @dev afterwards (e.g. call the work that got canceled)
@@ -50,21 +82,33 @@ void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
  * Returns: the subset of @mask that was actually pending, so the caller can run
  * those events inline.
  */
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask)
+{
+	return netdev_work_dequeue(dev, &dev->work_pending, mask);
+}
+EXPORT_SYMBOL(netdev_work_cancel);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long events)
+{
+	netdev_work_enqueue(dev, 0, events);
+}
+
 unsigned long
 __netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
 {
-	unsigned long event;
+	return netdev_work_dequeue(dev, &dev->work_core_pending, mask);
+}
 
-	spin_lock_bh(&netdev_work_lock);
-	event = dev->work_core_pending & mask;
-	dev->work_core_pending &= ~mask;
-	if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
-		list_del_init(&dev->work_node);
-		netdev_put(dev, &dev->work_tracker);
-	}
-	spin_unlock_bh(&netdev_work_lock);
+static void netdev_work_run(struct net_device *dev, unsigned long events,
+			    unsigned long core)
+{
+	if (!netif_device_present(dev))
+		return;
 
-	return event;
+	if (core & NETDEV_WORK_RX_MODE)
+		netif_rx_mode_run(dev);
+	if (events && dev->netdev_ops->ndo_work)
+		dev->netdev_ops->ndo_work(dev, events);
 }
 
 static void netdev_work_proc(struct work_struct *work)
@@ -72,9 +116,9 @@ static void netdev_work_proc(struct work_struct *work)
 	rtnl_lock();
 
 	while (true) {
+		unsigned long events = 0, core = 0;
 		netdevice_tracker tracker;
 		struct net_device *dev;
-		unsigned long core = 0;
 
 		spin_lock_bh(&netdev_work_lock);
 		if (list_empty(&netdev_work_list)) {
@@ -98,16 +142,17 @@ static void netdev_work_proc(struct work_struct *work)
 			list_del_init(&dev->work_node);
 			core = dev->work_core_pending;
 			dev->work_core_pending = 0;
+			events = dev->work_pending;
+			dev->work_pending = 0;
 			/* We took another ref above */
 			netdev_put(dev, &dev->work_tracker);
 
 			if (!dev_isalive(dev))
-				core = 0;
+				core = events = 0;
 		}
 		spin_unlock_bh(&netdev_work_lock);
 
-		if (core & NETDEV_WORK_RX_MODE)
-			netif_rx_mode_run(dev);
+		netdev_work_run(dev, events, core);
 		netdev_unlock_ops(dev);
 
 		netdev_put(dev, &tracker);
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>

The rx_mode update runs from a workqueue: drivers have their
ndo_set_rx_mode_async() callback executed by a single global
work item under RTNL and ops lock. This is a useful pattern.

Support multiple "events" that need to be serviced and make RX_MODE
sync the first one. Call the events "core" because later on
we will let drivers define and schedule their own.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/Makefile         |   2 +-
 include/linux/netdevice.h |  10 ++--
 net/core/dev.h            |  11 +++-
 net/core/dev.c            |   1 +
 net/core/dev_addr_lists.c |  77 +------------------------
 net/core/netdev_work.c    | 117 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 138 insertions(+), 80 deletions(-)
 create mode 100644 net/core/netdev_work.c

diff --git a/net/core/Makefile b/net/core/Makefile
index dc17c5a61e9a..b3fdcb4e355f 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y		     += dev.o dev_api.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
 			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
 			fib_notifier.o xdp.o flow_offload.o gro.o \
-			netdev-genl.o netdev-genl-gen.o gso.o
+			netdev-genl.o netdev-genl-gen.o netdev_work.o gso.o
 
 obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b67a12541eac..732506787db3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1930,8 +1930,9 @@ enum netdev_reg_state {
  *				has been enabled due to the need to listen to
  *				additional unicast addresses in a device that
  *				does not implement ndo_set_rx_mode()
- *	@rx_mode_node:		List entry for rx_mode work processing
- *	@rx_mode_tracker:	Refcount tracker for rx_mode work
+ *	@work_node:		List entry for async netdev_work processing
+ *	@work_tracker:		Refcount tracker for async netdev_work
+ *	@work_core_pending:	Core-defined pending netdev_work (NETDEV_WORK_*)
  *	@rx_mode_addr_cache:	Recycled snapshot entries for rx_mode work
  *	@rx_mode_retry_timer:	Timer that re-queues rx_mode work after failure
  *	@rx_mode_retry_count:	Number of consecutive retries already scheduled
@@ -2326,8 +2327,9 @@ struct net_device {
 	unsigned int		promiscuity;
 	unsigned int		allmulti;
 	bool			uc_promisc;
-	struct list_head	rx_mode_node;
-	netdevice_tracker	rx_mode_tracker;
+	struct list_head	work_node;
+	netdevice_tracker	work_tracker;
+	unsigned long		work_core_pending;
 	struct netdev_hw_addr_list	rx_mode_addr_cache;
 	struct timer_list	rx_mode_retry_timer;
 	unsigned int		rx_mode_retry_count;
diff --git a/net/core/dev.h b/net/core/dev.h
index 4121c50e7c88..5d0b0305d3ba 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -167,10 +167,19 @@ int dev_change_carrier(struct net_device *dev, bool new_carrier);
 void __dev_set_rx_mode(struct net_device *dev);
 int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify);
 void netif_rx_mode_init(struct net_device *dev);
-bool netif_rx_mode_clean(struct net_device *dev);
+void netif_rx_mode_run(struct net_device *dev);
 void netif_rx_mode_sync(struct net_device *dev);
 void netif_rx_mode_cancel_retry(struct net_device *dev);
 
+/* Events for the async netdev work, tracked in netdev->work_core_pending. */
+enum netdev_work_core {
+	NETDEV_WORK_RX_MODE	= BIT(0),	/* run the rx_mode update */
+};
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event);
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask);
+
 void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
 			unsigned int gchanges, u32 portid,
 			const struct nlmsghdr *nlh);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5c01dfaa6c44..e1d8af0ef6ab 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12093,6 +12093,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->ptype_all);
 	INIT_LIST_HEAD(&dev->ptype_specific);
 	INIT_LIST_HEAD(&dev->net_notifier_list);
+	INIT_LIST_HEAD(&dev->work_node);
 #ifdef CONFIG_NET_SCHED
 	hash_init(dev->qdisc_hash);
 #endif
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e17f64a65e17..08528ca0a8b3 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -12,17 +12,10 @@
 #include <linux/export.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
-#include <linux/workqueue.h>
 #include <kunit/visibility.h>
 
 #include "dev.h"
 
-static void netdev_rx_mode_work(struct work_struct *work);
-
-static LIST_HEAD(rx_mode_list);
-static DEFINE_SPINLOCK(rx_mode_lock);
-static DECLARE_WORK(rx_mode_work, netdev_rx_mode_work);
-
 /*
  * General list handling functions
  */
@@ -1281,7 +1274,7 @@ void netif_rx_mode_cancel_retry(struct net_device *dev)
 	dev->rx_mode_retry_count = 0;
 }
 
-static void netif_rx_mode_run(struct net_device *dev)
+void netif_rx_mode_run(struct net_device *dev)
 {
 	struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref;
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -1339,49 +1332,9 @@ static void netif_rx_mode_run(struct net_device *dev)
 	}
 }
 
-static void netdev_rx_mode_work(struct work_struct *work)
-{
-	struct net_device *dev;
-
-	rtnl_lock();
-
-	while (true) {
-		spin_lock_bh(&rx_mode_lock);
-		if (list_empty(&rx_mode_list)) {
-			spin_unlock_bh(&rx_mode_lock);
-			break;
-		}
-		dev = list_first_entry(&rx_mode_list, struct net_device,
-				       rx_mode_node);
-		list_del_init(&dev->rx_mode_node);
-		/* We must free netdev tracker under
-		 * the spinlock protection.
-		 */
-		netdev_tracker_free(dev, &dev->rx_mode_tracker);
-		spin_unlock_bh(&rx_mode_lock);
-
-		netdev_lock_ops(dev);
-		netif_rx_mode_run(dev);
-		netdev_unlock_ops(dev);
-		/* Use __dev_put() because netdev_tracker_free() was already
-		 * called above. Must be after netdev_unlock_ops() to prevent
-		 * netdev_run_todo() from freeing the device while still in use.
-		 */
-		__dev_put(dev);
-	}
-
-	rtnl_unlock();
-}
-
 static void netif_rx_mode_queue(struct net_device *dev)
 {
-	spin_lock_bh(&rx_mode_lock);
-	if (list_empty(&dev->rx_mode_node)) {
-		list_add_tail(&dev->rx_mode_node, &rx_mode_list);
-		netdev_hold(dev, &dev->rx_mode_tracker, GFP_ATOMIC);
-	}
-	spin_unlock_bh(&rx_mode_lock);
-	schedule_work(&rx_mode_work);
+	__netdev_work_core_sched(dev, NETDEV_WORK_RX_MODE);
 }
 
 static void netif_rx_mode_retry(struct timer_list *t)
@@ -1394,7 +1347,6 @@ static void netif_rx_mode_retry(struct timer_list *t)
 
 void netif_rx_mode_init(struct net_device *dev)
 {
-	INIT_LIST_HEAD(&dev->rx_mode_node);
 	__hw_addr_init(&dev->rx_mode_addr_cache);
 	timer_setup(&dev->rx_mode_retry_timer, netif_rx_mode_retry, 0);
 }
@@ -1442,24 +1394,6 @@ void dev_set_rx_mode(struct net_device *dev)
 	netif_addr_unlock_bh(dev);
 }
 
-bool netif_rx_mode_clean(struct net_device *dev)
-{
-	bool clean = false;
-
-	spin_lock_bh(&rx_mode_lock);
-	if (!list_empty(&dev->rx_mode_node)) {
-		list_del_init(&dev->rx_mode_node);
-		clean = true;
-		/* We must release netdev tracker under
-		 * the spinlock protection.
-		 */
-		netdev_tracker_free(dev, &dev->rx_mode_tracker);
-	}
-	spin_unlock_bh(&rx_mode_lock);
-
-	return clean;
-}
-
 /**
  * netif_rx_mode_sync() - sync rx mode inline
  * @dev: network device
@@ -1473,11 +1407,6 @@ bool netif_rx_mode_clean(struct net_device *dev)
  */
 void netif_rx_mode_sync(struct net_device *dev)
 {
-	if (netif_rx_mode_clean(dev)) {
+	if (__netdev_work_core_cancel(dev, NETDEV_WORK_RX_MODE))
 		netif_rx_mode_run(dev);
-		/* Use __dev_put() because netdev_tracker_free() was already
-		 * called inside netif_rx_mode_clean().
-		 */
-		__dev_put(dev);
-	}
 }
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
new file mode 100644
index 000000000000..c121c24dc493
--- /dev/null
+++ b/net/core/netdev_work.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <net/netdev_lock.h>
+
+#include "dev.h"
+
+static void netdev_work_proc(struct work_struct *work);
+
+/* @netdev_work_lock protects:
+ *  - @netdev_work_list
+ *  - within the list entries (struct net_device fields):
+ *	- work_node
+ *	- work_tracker
+ *	- work_core_pending
+ */
+static LIST_HEAD(netdev_work_list);
+static DEFINE_SPINLOCK(netdev_work_lock);
+static DECLARE_WORK(netdev_work, netdev_work_proc);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+{
+	spin_lock_bh(&netdev_work_lock);
+	if (list_empty(&dev->work_node)) {
+		list_add_tail(&dev->work_node, &netdev_work_list);
+		netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
+	}
+	dev->work_core_pending |= event;
+	spin_unlock_bh(&netdev_work_lock);
+
+	schedule_work(&netdev_work);
+}
+
+/**
+ * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * @dev: net_device
+ * @mask: events to cancel
+ *
+ * Clear @mask from the device's work pending mask. If no work is left pending
+ * the device is dequeued.
+ *
+ * No expectations on locking, but also no guarantees provided. If the caller
+ * wants to touch @dev afterwards (e.g. call the work that got canceled)
+ * they have to ensure @dev does not get freed.
+ *
+ * Returns: the subset of @mask that was actually pending, so the caller can run
+ * those events inline.
+ */
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
+{
+	unsigned long event;
+
+	spin_lock_bh(&netdev_work_lock);
+	event = dev->work_core_pending & mask;
+	dev->work_core_pending &= ~mask;
+	if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
+		list_del_init(&dev->work_node);
+		netdev_put(dev, &dev->work_tracker);
+	}
+	spin_unlock_bh(&netdev_work_lock);
+
+	return event;
+}
+
+static void netdev_work_proc(struct work_struct *work)
+{
+	rtnl_lock();
+
+	while (true) {
+		netdevice_tracker tracker;
+		struct net_device *dev;
+		unsigned long core = 0;
+
+		spin_lock_bh(&netdev_work_lock);
+		if (list_empty(&netdev_work_list)) {
+			spin_unlock_bh(&netdev_work_lock);
+			break;
+		}
+		dev = list_first_entry(&netdev_work_list, struct net_device,
+				       work_node);
+		/* Take a temporary reference so @dev can't be freed while we
+		 * drop the lock to grab its ops lock; the work reference is
+		 * only released once we claim the work below.
+		 * The re-locking dance is to ensure that ops lock is enough
+		 * to ensure canceling work is not racy with dequeue.
+		 */
+		netdev_hold(dev, &tracker, GFP_ATOMIC);
+		spin_unlock_bh(&netdev_work_lock);
+
+		netdev_lock_ops(dev);
+		spin_lock_bh(&netdev_work_lock);
+		if (!list_empty(&dev->work_node)) {
+			list_del_init(&dev->work_node);
+			core = dev->work_core_pending;
+			dev->work_core_pending = 0;
+			/* We took another ref above */
+			netdev_put(dev, &dev->work_tracker);
+
+			if (!dev_isalive(dev))
+				core = 0;
+		}
+		spin_unlock_bh(&netdev_work_lock);
+
+		if (core & NETDEV_WORK_RX_MODE)
+			netif_rx_mode_run(dev);
+		netdev_unlock_ops(dev);
+
+		netdev_put(dev, &tracker);
+	}
+
+	rtnl_unlock();
+}
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 0/4] net: avoid nested UP notifier events
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski

syzbot reported that recent ethtool rework leads to deadlock
on stacked devices. VLANs create nested notifications, confusing
execution context. Bringing up dummy causes vlan to bring itself
up as well. Which in turn causes bond to ask for link state -
a call chain traveling in the opposite direction.

  bond    (3) bond_update_speed_duplex(vlan)
    |           ^                v
  vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
    |           ^                v
  dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()

We locked the instance lock of dummy at (1) and will will
try to lock it again at (5) - which of course deadlocks.

For non-nested notifications this is avoided because NETDEV_UP
is always run ops-locked (so that bond asks for link using the
netif_ API which assumes instance lock already held). The nesting,
however, makes this problematic, we cannot carry the state of
the whole chain back in the opposite direction.

AFAICT vlan is the only driver which causes such issues.
So let's try a localized fix of deferring vlan auto-open
to a workqueue.

Jakub Kicinski (4):
  net: turn the rx_mode work into a generic netdev_work facility
  net: add the driver-facing netdev_work scheduling API
  vlan: defer real device state propagation to netdev_work
  selftests: bonding: add a test for VLAN propagation over a bonded real
    device

 Documentation/networking/netdevices.rst       |   2 +
 net/core/Makefile                             |   2 +-
 .../selftests/drivers/net/bonding/Makefile    |   1 +
 include/linux/netdevice.h                     |  21 +-
 net/8021q/vlan.h                              |  11 ++
 net/core/dev.h                                |  11 +-
 net/8021q/vlan.c                              |  76 +-------
 net/8021q/vlan_dev.c                          |  60 ++++++
 net/core/dev.c                                |   2 +
 net/core/dev_addr_lists.c                     |  77 +-------
 net/core/netdev_work.c                        | 162 ++++++++++++++++
 .../drivers/net/bonding/bond_vlan_real_dev.sh | 180 ++++++++++++++++++
 12 files changed, 457 insertions(+), 148 deletions(-)
 create mode 100644 net/core/netdev_work.c
 create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH] net: sparx5: unregister blocking notifier on init failure
From: Simon Horman @ 2026-06-24 18:16 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Steen.Hegelund,
	daniel.machon, UNGLinuxDriver, kees, bjarni.jonasson,
	lars.povlsen, netdev, linux-arm-kernel, linux-kernel, stable
In-Reply-To: <20260623115714.2192074-1-haoxiang_li2024@163.com>

On Tue, Jun 23, 2026 at 07:57:14PM +0800, Haoxiang Li wrote:
> sparx5_register_notifier_blocks() registers the switchdev blocking
> notifier before allocating the ordered workqueue. If the workqueue
> allocation fails, the error path unregisters the switchdev and netdevice
> notifiers, but leaves the blocking notifier registered.
> 
> Add a separate error label for the workqueue allocation failure path and
> unregister the switchdev blocking notifier there.
> 
> Fixes: d6fce5141929 ("net: sparx5: add switching support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH net] dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback
From: Niklas Söderlund @ 2026-06-24 18:07 UTC (permalink / raw)
  To: Rob Herring (Arm)
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Krzysztof Kozlowski, Conor Dooley,
	Geert Uytterhoeven, Magnus Damm, Sergei Shtylyov, netdev,
	linux-renesas-soc, devicetree, linux-kernel
In-Reply-To: <20260624150250.131966-2-robh@kernel.org>

Hi Rob,

Thanks for your patch.

On 2026-06-24 10:02:50 -0500, Rob Herring (Arm) wrote:
> Fix the Micrel PHY in the example which shouldn't have the
> fallback "ethernet-phy-ieee802.3-c22" compatible:
> 
> Documentation/devicetree/bindings/net/renesas,ether.example.dtb: ethernet-phy@1 \
>   (ethernet-phy-id0022.1537): compatible: ['ethernet-phy-id0022.1537', 'ethernet-phy-ieee802.3-c22'] is too long
>         from schema $id: http://devicetree.org/schemas/net/micrel.yaml
> 
> Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Acked-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>

> ---
>  Documentation/devicetree/bindings/net/renesas,ether.yaml | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/renesas,ether.yaml b/Documentation/devicetree/bindings/net/renesas,ether.yaml
> index f0a52f47f95a..dd7187f12a67 100644
> --- a/Documentation/devicetree/bindings/net/renesas,ether.yaml
> +++ b/Documentation/devicetree/bindings/net/renesas,ether.yaml
> @@ -121,8 +121,7 @@ examples:
>          #size-cells = <0>;
>  
>          phy1: ethernet-phy@1 {
> -            compatible = "ethernet-phy-id0022.1537",
> -                         "ethernet-phy-ieee802.3-c22";
> +            compatible = "ethernet-phy-id0022.1537";
>              reg = <1>;
>              interrupt-parent = <&irqc0>;
>              interrupts = <0 IRQ_TYPE_LEVEL_LOW>;
> -- 
> 2.53.0
> 

-- 
Kind Regards,
Niklas Söderlund

^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add tests for bpf_redirect_peer with BPF_F_EGRESS
From: Daniel Borkmann @ 2026-06-24 17:54 UTC (permalink / raw)
  To: Jordan Rife, bpf
  Cc: netdev, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	Stanislav Fomichev, Jiayuan Chen, Paul Chaignon
In-Reply-To: <20260618182035.43811-3-jordan@jrife.io>

On 6/18/26 8:20 PM, Jordan Rife wrote:
> Extend redirect tests to cover bpf_redirect_peer(BPF_F_EGRESS). SRC
> redirects to DST using bpf_redirect_peer(BPF_F_EGRESS) then traffic is
> hairpinned into DST using bpf_redirect.
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/2] bpf: Support BPF_F_EGRESS with bpf_redirect_peer
From: Daniel Borkmann @ 2026-06-24 17:53 UTC (permalink / raw)
  To: Jordan Rife, bpf
  Cc: netdev, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
	Stanislav Fomichev, Jiayuan Chen, Paul Chaignon
In-Reply-To: <20260618182035.43811-2-jordan@jrife.io>

On 6/18/26 8:20 PM, Jordan Rife wrote:
> We have several use cases where a pod injects traffic into the datapath
> of another so that the traffic appears to have originated from that
> pod. One such use case is a synthetic flow generator which injects
> synthetic traffic into a pod's datapath to enable dynamic probing and
> debugging. Another is a transparent proxy where connections originating
> from one pod are redirected towards another which proxies that
> connection. The new connection is bound to the IP of the original pod
> using IP_TRANSPARENT and its traffic is injected into that pod's
> datapath and handled as if it had originated there. This can be used for
> mTLS, etc.
> 
> We use bpf_redirect(BPF_F_INGRESS) to direct traffic leaving the proxy,
> flow generator, etc. towards the target pod, ensuring that eBPF programs
> that are meant to intercept traffic leaving that pod are executed.
> However, this doesn't work with netkit.
> 
> With netkit, an ingress redirection from proxy to workload skips eBPF
> programs that are meant to intercept traffic leaving the pod, since they
> reside on the netkit peer device. One workaround is to attach the
> same program to both the netkit peer device and the TCX ingress hook for
> the netkit pair's primary interface, but
> 
> a) This seems hacky and we need to be careful not to run the same
>     program twice for the same skb in cases where we want to pass that
>     traffic to the host stack.
> b) We're trying to keep the proxy redirection / traffic injection
>     systems as modular and separated from Cilium as possible, the system
>     that manages netkit setup and core eBPF programming.
> 
> It would be handy if instead we could redirect traffic directly from
> one netkit peer device to another. This patch proposes an extension
> to bpf_redirect_peer to allow us to do just that.
> 
> With this patch, the BPF_F_EGRESS flag tells bpf_redirect_peer to emit
> the skb in the egress direction of the target interface's peer device
> While the main use case is netkit, I suppose you could also use this
> mode with veth as well if, e.g., there were some eBPF programs attached
> to that side of the veth pair that needed to intercept traffic.
> 
>   +---------------------------------------------------------------------+
>   | +-------------------------+         6. bpf_redirect_neigh(eth0)     |
>   | | pod (10.244.0.10)       |           ------------------------      |
>   | |                         |          |                        |     |
>   | |              +--------+ |          |      +---------+       |     |
>   | | 1. packet -->|        | |          |      |         |       |     |
>   | |    leaves ^  | netkit |<===========|======| netkit  |       |     |
>   | |           |  | peer   |=======(eBPF)=====>| primary |       |     |
>   | |           |  |        | |          |      |         |       |     |
>   | |           |  +--------+ |          |      +---------+       |     |
>   | |           |             |          | 2. bpf_redirect        v     |
>   | +-----------|-------------+          |___________________   +-------|
>   |             |                                            |  | eth0  |
>   |             | 5. bpf_redirect_peer(BPF_F_EGRESS)         |  +-------|
>   |             |________________________                    |          |
>   | +-------------------------+          |                   |          |
>   | | proxy (10.244.0.11)     |          |                   |          |
>   | | IP_TRANSPARENT          |          |                   |          |
>   | |              +--------+ |          |      +---------+  |          |
>   | | 3. packet <--|        | |          |      |         |<--          |
>   | |    enters    | netkit |<===========|======| netkit  |             |
>   | |    [proxy]   | peer   |=======(eBPF)=====>| primary |             |
>   | | 4. packet -->|        | |                 |         |             |
>   | |    leaves    +--------+ |                 +---------+             |
>   | |    sip=10.244.0.10      |                                         |
>   | +-------------------------+                                         |
>   +---------------------------------------------------------------------+
> 
> Using the proxy use case as an example, in step 5 we would redirect
> traffic leaving the proxy towards the pod's peer device using
> bpf_redirect_peer(BPF_F_EGRESS).
> 
> As a bonus, since the skb doesn't have to go through the backlog queue
> it can take full advantage of netkit's performance benefits. I set up a
> test where outgoing iperf3 traffic is injected into the datapath of
> another pod using either bpf_redirect_peer(BPF_F_EGRESS) or
> bpf_redirect(BPF_F_INGRESS). I used Cilium's eBPF host routing mode
> which skips the host stack and uses BPF redirect helpers to do all the
> routing.
> 
>    (net.ipv4.tcp_congestion_control=cubic,mtu=1500,100GiB link,Cilium
>     eBPF host routing mode)
> 
> BASELINE [bpf_redirect(BPF_F_INGRESS)]
>    1. [iperf pod] ==bpf_redirect([pod b], BPF_F_INGRESS)==> [pod b]
>    2. [pod b]     ==bpf_redirect_neigh([eth0])==>           eth0
>    3. eth0        ==over network==>                         [host b]
> 
>    [ ID] Interval           Transfer     Bitrate         Retr
>    [  5]   0.00-60.00  sec   231 GBytes  33.0 Gbits/sec  12060     sender
>    [  5]   0.00-60.00  sec   230 GBytes  33.0 Gbits/sec            receiver
> 
> TEST [bpf_redirect_peer(BPF_F_EGRESS)]
>    1. [iperf pod] ==bpf_redirect_peer([pod b], BPF_F_EGRESS)==> [pod b]
>    2. [pod b]     ==bpf_redirect_neigh([eth0])==>               eth0
>    3. eth0        ==over network==>                             [host b]
> 
>    [ ID] Interval           Transfer     Bitrate         Retr
>    [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec    0       sender
>    [  5]   0.00-60.00  sec   272 GBytes  38.9 Gbits/sec            receiver
> 
> In this test, using bpf_redirect_peer(BPF_F_EGRESS) for the hop from
> [iperf pod] to [pod b] led to ~18% more throughput compared to
> bpf_redirect(BPF_F_INGRESS).
> 
> Signed-off-by: Jordan Rife <jordan@jrife.io>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-24 17:52 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, torvalds, val, viro, willy
In-Reply-To: <20260624071226.2272209-1-safinaskar@gmail.com>

On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > The CRIU fifo test fails with this change. The problem is that vmsplice
> > with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> >
> > It seems we need a fix like this one:
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index 429b0714ec57..6fc49e933727 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> > file *filp)
> >
> >         /* We can only do regular read/write on fifos */
> >         stream_open(inode, filp);
> > +       filp->f_mode |= FMODE_NOWAIT;
> >
> >         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
> >         case FMODE_READ:
>
> Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> named fifos? Or this is merely a test?

Yes, it does.

>
> If this is just a test, I think we need not to preserve this behavior.
>
> I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
> found very few packages. And it seems all them use pipes, not named fifos.

In short, this isn't how such cases are handled in the kernel. The fix is
simple and should be applied to avoid breaking random software.

>
> (On speed: I still think that my vmsplice patches are good thing,
> despite performance regressions in CRIU.)

I already explained that this isn't just a perfomance degradation, it
actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
our parasite code within the context of a user process, where execution
speed is critical. A heavy performance penalty completely invalidates
the pre-dump logic, making the feature useless.

Under normal circumstances, patches that cause this kind of breakage
would never be merged. However, since there are exceptions to every
rule, we should let the maintainers decide how to proceed here. In CRIU,
we have a backup plan to utilize process_vm_readv to dump process
memory. We already support this mode, but it isn't the default due to
performance concerns. If these patches are merged, it will be the
only option left for CRIU to implement pre-dumping.

However, we need to look at this case in a broader context. This is yet
another example where the change introduces a workflow breakage, meaning
there might be other workloads out there that could be broken by this
change.

At a minimum, we may need to consider a deprecation plan where vmsplice
with SPLICE_F_GIFT triggers a warning for a few releases before these
changes are applied. Alternatively, we could introduce the proposed
behavior alongside a sysctl to fall back to the old behavior and explicitly
state that this fallback path will be completely deprecated in a future kernel
version.

Thanks,
Andrei

^ permalink raw reply

* Re: [PATCH v12 02/12] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Pawan Gupta @ 2026-06-24 17:49 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: x86, Jon Kohler, H. Peter Anvin, Josh Poimboeuf, David Kaplan,
	Sean Christopherson, Borislav Petkov, Dave Hansen, Peter Zijlstra,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, KP Singh,
	Jiri Olsa, David S. Miller, David Laight, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, David Ahern, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	Stanislav Fomichev, Hao Luo, Paolo Bonzini, Jonathan Corbet,
	Jason Baron, Alice Ryhl, Steven Rostedt, Ard Biesheuvel,
	Shuah Khan, linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf,
	netdev, linux-doc
In-Reply-To: <171efe97-fd87-45c1-9913-ff62eacab400@suse.com>

On Wed, Jun 24, 2026 at 03:12:28PM +0300, Nikolay Borisov wrote:
> 
> 
> On 23.06.26 г. 20:33 ч., Pawan Gupta wrote:
> > As a mitigation for BHI, clear_bhb_loop() executes branches that overwrite
> > the Branch History Buffer (BHB). On Alder Lake and newer parts this
> > sequence is not sufficient because it doesn't clear enough entries. This
> > was not an issue because these CPUs use the BHI_DIS_S hardware mitigation
> > in the kernel.
> > 
> > Now with VMSCAPE (BHI variant) it is also required to isolate branch
> > history between guests and userspace. Since BHI_DIS_S only protects the
> > kernel, the newer CPUs also use IBPB.
> > 
> > A cheaper alternative to the current IBPB mitigation is clear_bhb_loop().
> > But it currently does not clear enough BHB entries to be effective on newer
> > CPUs with larger BHB. At boot, dynamically set the loop count of
> > clear_bhb_loop() such that it is effective on newer CPUs too.
> > 
> > Introduce global loop counts, initializing them with appropriate value
> > based on the hardware feature X86_FEATURE_BHI_CTRL.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
> > Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> 
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> 
> Although AI brings up a valid argument about whether guests should be
> pessimized and fallback to the longer sequence ?

I don't disagree, but at the same time BHI mitigation for guest migration
is a different beast that should be addressed separately. A series that
adds virtual-SPEC_CTRL support is in the works. Expect the RFC to be posted
in a couple of weeks.

^ permalink raw reply

* Re: [PATCH 0/2] Fix a few memory bugs in RPC-with-TLS
From: Anna Schumaker @ 2026-06-24 17:41 UTC (permalink / raw)
  To: Chuck Lever, Trond Myklebust
  Cc: linux-nfs, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michael Nemanov, Chuck Lever
In-Reply-To: <20260624165228.2920869-1-cel@kernel.org>

Hi Chuck,

On Wed, Jun 24, 2026, at 12:52 PM, Chuck Lever wrote:
> Gentle ping on this series, posted seven weeks ago. Michael
> Nemanov reviewed and tested both patches the following day; he is
> the reporter of the use-after-free that patch 2 addresses on an
> mTLS mount whose client certificate the server rejected.
>
> Could one of you queue these for an upcoming release? I am glad
> to repost against a current base if that is easier to apply.

I don't remember seeing these patches when they came in initially. Sorry
about that! I'll take a look soon, and try to include them in a bugfixes
pull request.

Anna

>
> --
> Chuck Lever

^ permalink raw reply

* Re: [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
From: Eric Dumazet @ 2026-06-24 17:22 UTC (permalink / raw)
  To: Pengfei Zhang
  Cc: dsahern, idosch, davem, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang
In-Reply-To: <20260624171156.822055-1-zhangfeionline@gmail.com>

On Wed, Jun 24, 2026 at 10:12 AM Pengfei Zhang <zhangfeionline@gmail.com> wrote:
>
> From: Pengfei Zhang <zhangpengfei16@xiaomi.com>
>
> inet6_dump_fib() saves its progress in cb->args[1] as a positional
> index within the current hash chain.  Between batches the RTNL lock
> is released, so a concurrent fib6_new_table() can insert a new table
> at the chain head, shifting all existing entries.  The saved index
> then lands on a different table, causing fib6_dump_table() to set
> w->root to the wrong table while w->node still points into the
> previous one.  fib6_walk_continue() dereferences w->node->parent
> (NULL) and panics:
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000008
>   RIP: 0010:fib6_walk_continue+0x6e/0x170
>   Call Trace:
>    <TASK>
>    fib6_dump_table.isra.0+0xc5/0x240
>    inet6_dump_fib+0xf6/0x420
>    rtnl_dumpit+0x30/0xa0
>    netlink_dump+0x15b/0x460
>    netlink_recvmsg+0x1d6/0x2a0
>    ____sys_recvmsg+0x17a/0x190
>
> Fix by storing tb->tb6_id in cb->args[1] instead of a positional
> index.  On resume, skip entries until the id matches; a concurrent
> head-insert can never match the saved id, so the walker always
> resumes on the correct table.
>
> Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>

Patch looks good, but you forgot to add a Fixes: tag

Perhaps:

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing
tables to 2^32")

> ---
> The same crash was independently reported in a production environment
> (kernel 5.15.137, triggered by ovs-vswitchd issuing RTM_GETROUTE):
>   https://lkml.iu.edu/hypermail/linux/kernel/2402.3/02068.html
>
> The crash is probabilistic and occurs in fib6_walk_continue() at the
> FWS_U state:
>
>   case FWS_U:
>       if (fn == w->root)
>           return 0;
>       pn = rcu_dereference_protected(fn->parent, 1);
>       left = rcu_dereference_protected(pn->left, 1);  /* crash here */
>
> The crash dump shows fn->parent is NULL.  At first glance this looks
> like fn is a leaf node whose parent was freed, but closer inspection of
> the walker state reveals fn->fn_flags has RTN_ROOT set — fn is itself
> a root node of a routing table, not a child node.  A root node has no
> parent by definition, so fn->parent == NULL is correct for that node.
>
> The real question is why fn != w->root despite fn being a root.  The
> answer is that w->root and fn belong to *different* tables: w->node
> (which became fn during traversal) still references a node from the
> table that was being dumped when the batch suspended, while w->root was
> silently redirected to a different table on resume.
>
> This misdirection happens because inet6_dump_fib() uses a positional
> index to resume across batches.  Consider a hash slot containing two
> tables [A(pos=0), B(pos=1)] where B is large enough to require multiple
> batches.  On the first batch, B suspends mid-walk and the loop saves:
>
>   cb->args[1] = e;   /* e=1, position of B in the chain */
>
> The RTNL lock is then released.  At this point a concurrent
> fib6_new_table() inserts table C at the chain head via
> hlist_add_head_rcu(), making the chain [C(pos=0), A(pos=1), B(pos=2)].
>
> On the next batch, inet6_dump_fib() resumes with s_e=1 and iterates:
>
>   s_e = cb->args[1];   /* s_e = 1 */
>   hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
>       if (e < s_e)     /* skip C at pos=0 */
>           goto next;
>       /* e=1: tb now points to A, not B */
>       fib6_dump_table(tb, skb, cb);   /* called with wrong table A */
>   }
>
> Inside fib6_dump_table(), w->root is unconditionally overwritten
> before the resume branch is entered:
>
>   w->root = &table->tb6_root;        /* now A's root              */
>   /* ... */
>   } else {
>       int sernum = READ_ONCE(w->root->fn_sernum);  /* A's sernum  */
>       if (cb->args[5] != sernum) {
>           /* sernum changed: safe reset, w->node = w->root (A)    */
>           w->node = w->root;
>       } else {
>           /* sernum unchanged: w->node untouched, still in B       */
>           w->skip = 0;
>       }
>       fib6_walk_continue(w);   /* sernum equal: w->root=A, w->node=B */
>   }
>
> The sernum guard was intended to detect tree modifications and reset
> the walk, but here the two tables happen to share the same fn_sernum
> value (a global flush had previously unified them), so the guard does
> not fire and w->node is left pointing into B's tree.
>
> From this point w->root and w->node belong to different tables.  When
> fib6_walk_continue() traverses upward and reaches B's root node
> (fn->fn_flags & RTN_ROOT), the exit check:
>
>   if (fn == w->root)   /* B's root != A's root, check fails */
>       return 0;
>   pn = fn->parent;     /* B's root has no parent: pn == NULL */
>   left = pn->left;     /* NULL deref -> crash */
>
>  net/ipv6/ip6_fib.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index fc95738de..bda492634 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>         };
>         const struct nlmsghdr *nlh = cb->nlh;
>         struct net *net = sock_net(skb->sk);
> -       unsigned int e = 0, s_e;
>         struct hlist_head *head;
>         struct fib6_walker *w;
>         struct fib6_table *tb;
>         unsigned int h, s_h;
> +       u32 s_id;
>         int err = 0;
>
>         rcu_read_lock();
> @@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>         }
>
>         s_h = cb->args[0];
> -       s_e = cb->args[1];
> +       s_id = cb->args[1];
>
> -       for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
> -               e = 0;
> +       for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
>                 head = &net->ipv6.fib_table_hash[h];
>                 hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
> -                       if (e < s_e)
> -                               goto next;
> +                       if (s_id && tb->tb6_id != s_id)
> +                               continue;
> +                       s_id = 0;
> +
> +                       cb->args[1] = tb->tb6_id;
>                         err = fib6_dump_table(tb, skb, cb);
>                         if (err != 0)
>                                 goto out;
> -next:
> -                       e++;
>                 }
>         }
>  out:
> -       cb->args[1] = e;
>         cb->args[0] = h;
>
>  unlock:
> --
> 2.34.1
>

^ permalink raw reply

* Re: [Bug ?] Packet with End.X segment not correctly forwarded to nexthop
From: Andrea Mayer @ 2026-06-24 17:18 UTC (permalink / raw)
  To: Anthony Doeraene
  Cc: netdev, Nicolas Dichtel, stefano.salsano, Paolo Lungaroni,
	Andrea Mayer
In-Reply-To: <3fe17ca9-25fa-4fb1-ba06-4463e0409a8b@uclouvain.be>

On Fri, 19 Jun 2026 15:25:09 +0200
Anthony Doeraene <anthony.doeraene@uclouvain.be> wrote:

> Hello,

Hi Anthony,
thanks for the description and the reproducer.

>
> I am currently experimenting with SRv6 and VRFs, and I found some weird
> interactions between the two.
>
> For the context, I need routers to have multiple VRFs, with each VRF
> having different routes to reach destinations.
> Our routers not only send packets to a specific nexthop, but also
> specify the VRF that the nexthop
> should use to forward these packets.
> To achieve this goal, routes in these VRFs push two segments: a local
> End.X segment, and a End.DT46 segment.
> Due to some implementation constraints, I want to have a single End.DT46
> segment shared by
> all routers in the network.
>
> Once packets are encapsulated by the VRF, the packet is sent in the main
> table to do a lookup for the nexthop.
> As the End.DT46 segment is shared between routers and can not be used to
> learn the nexthop, I decided to
> use an End.X segment to specify it.
>
> However, what I observe in this scenario is that End.X segment
> processing function is never called, resulting
> in the packet not being sent to the correct nexthop.

As I understand it, you are encapsulating a packet on r1 and you want to
decapsulate it on r2.

The first segment you want to apply on r1 is a local adjacency SID, bound to
the End.X behavior. When applied, it would force the encapsulated packet
onto the r1-r2 link toward your neighbour r2, and advance the segment list to
your next segment, the shared SID bound to End.DT46.

Anyway, as you have noticed, this is not what happens. The reason is that
the packet is generated locally on r1, so it is on the transmit path. An
SRv6 endpoint behavior is triggered by a SID in a packet the node receives,
not in one it originates and transmits. So the End.X behavior is never
triggered: the packet is just encapsulated and routed using the first
segment as its destination address, and not steered.

To steer the packet, you need to control the table in which the post-encap
lookup resolves the first segment. This need already came up for another
use case, originally introduced by Nicolas Dichtel. We referred to it as
'route leaking': the lookup must be done in the underlay table, so the
encapsulated traffic 'leaks' from the VRF to the underlay.

We worked on a patch for the seg6 encap that lets you choose the FIB table
for the post-encap lookup. I will post it when net-next reopens. You then
push only the shared SID and, in a table of your choice, add a route for it
via your next hop. The post-encap lookup resolves it there, and your next
hop receives that SID as the active SID and handles it:

  # next hop for the shared decap SID, in the table you pick
  ip -6 route add fc00:ffff:: via fc00::1:2 dev r1-r2 table 100

  # encap route: segment list is only the shared SID, resolved in table 100
  ip -6 route add fc00::2 encap seg6 mode encap segs fc00:ffff:: \
      lookup 100 table 10 dev X

- X is used just to determine the IPv6 SA of the encapsulated packet;
- table 10 is the encap route's table; if a VRF is bound to it, you can pass
  vrf <name> instead.

The lookup table can also be your VRF table, so the next-hop route sits next
to the encap route.

>
> I am wondering if this is an expected behavior (i.e. a node should never
> push a local segment), or if it is a real bug ?

It is expected, not a bug. A node may push a local segment. The point is
that an SRv6 endpoint behavior is not triggered on traffic the node
originates, as these behaviors run on the receive path.

>
> I am not well versed into the implementation details of SRv6 in the
> kernel, but I'm suspecting that this "bug" comes
> from the fact that seg6_output_core calls dst_output, which does not
> allow an SRv6 segment function to be called.

You understood the mechanism correctly. As explained above, seg6_output_core
ends with dst_output on the output path, so the seg6_local endpoint function
is not called there and this is by design.

>
> A minimal example is given below, which creates two namespaces (r1, r2) 
> and allows to reproduce this behavior.
> (tested on a kernel compiled on virtme-ng from commit
> e771677c937da5808f7b6c1f0e4a97ec1a84f8a8)
>
> Thank you in advance for the help and thanks for the SRv6 support on Linux,
> Doeraene Anthony
>
> File setup.sh
> ```
> # Topology under test:
> #
> #                    fc00::1:1       fc00::1:2
> # fc00::1 [ r1 ] ------------------------- [ r2 ] fc00::2
> #
> # Description:
> # ============
> #
> # Each node has an additional VRF, which it can use to provide different
> # routing decisions based on arbitrary rules (e.g. QoS aware forwarding)
> # Routes in this VRF will encapsulate the packets and push segments to
> # specify the nexthop (End.X) and the VRF the nexthop should use
> # (End.DT46). The same End.DT46 segment is shared by all nodes
> #
> # Problem:
> # ========
> #
> # Once segments are pushed, the End.X segment is never applied. As a
> # result, the segment is not popped from the SL, and the packet is sent
> # on an incorrect interface.
> #
> # Forwarding steps:
> # =================
> #
> # - R1 sends the packet to fc00::2 in its VRF `myvrf`
> # - This VRF encapsulates the packet and add two segments:
> #   1) End.X segment to force the transmission of the packet on r1-r2
> #   2) End.DT46 segment allowing r2 to know which VRF it should use
> #      to forward the packet.
> # - After encapsulation, r1 does a lookup in its main table for the
> #   End.X segment, but does not pop the segment. The packet is thus
> #   sent incorrectly on the dummy interface
> #
> [snip]
>
> ```
>

Ciao,
Andrea

^ permalink raw reply

* [PATCH] ipv6: fib6: fix NULL deref in fib6_walk_continue() on multi-batch dump
From: Pengfei Zhang @ 2026-06-24 17:11 UTC (permalink / raw)
  To: dsahern, idosch
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	chenzhangqi, baohua, Pengfei Zhang, Pengfei Zhang

From: Pengfei Zhang <zhangpengfei16@xiaomi.com>

inet6_dump_fib() saves its progress in cb->args[1] as a positional
index within the current hash chain.  Between batches the RTNL lock
is released, so a concurrent fib6_new_table() can insert a new table
at the chain head, shifting all existing entries.  The saved index
then lands on a different table, causing fib6_dump_table() to set
w->root to the wrong table while w->node still points into the
previous one.  fib6_walk_continue() dereferences w->node->parent
(NULL) and panics:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  RIP: 0010:fib6_walk_continue+0x6e/0x170
  Call Trace:
   <TASK>
   fib6_dump_table.isra.0+0xc5/0x240
   inet6_dump_fib+0xf6/0x420
   rtnl_dumpit+0x30/0xa0
   netlink_dump+0x15b/0x460
   netlink_recvmsg+0x1d6/0x2a0
   ____sys_recvmsg+0x17a/0x190

Fix by storing tb->tb6_id in cb->args[1] instead of a positional
index.  On resume, skip entries until the id matches; a concurrent
head-insert can never match the saved id, so the walker always
resumes on the correct table.

Signed-off-by: Pengfei Zhang <zhangfeionline@gmail.com>
---
The same crash was independently reported in a production environment
(kernel 5.15.137, triggered by ovs-vswitchd issuing RTM_GETROUTE):
  https://lkml.iu.edu/hypermail/linux/kernel/2402.3/02068.html

The crash is probabilistic and occurs in fib6_walk_continue() at the
FWS_U state:

  case FWS_U:
      if (fn == w->root)
          return 0;
      pn = rcu_dereference_protected(fn->parent, 1);
      left = rcu_dereference_protected(pn->left, 1);  /* crash here */

The crash dump shows fn->parent is NULL.  At first glance this looks
like fn is a leaf node whose parent was freed, but closer inspection of
the walker state reveals fn->fn_flags has RTN_ROOT set — fn is itself
a root node of a routing table, not a child node.  A root node has no
parent by definition, so fn->parent == NULL is correct for that node.

The real question is why fn != w->root despite fn being a root.  The
answer is that w->root and fn belong to *different* tables: w->node
(which became fn during traversal) still references a node from the
table that was being dumped when the batch suspended, while w->root was
silently redirected to a different table on resume.

This misdirection happens because inet6_dump_fib() uses a positional
index to resume across batches.  Consider a hash slot containing two
tables [A(pos=0), B(pos=1)] where B is large enough to require multiple
batches.  On the first batch, B suspends mid-walk and the loop saves:

  cb->args[1] = e;   /* e=1, position of B in the chain */

The RTNL lock is then released.  At this point a concurrent
fib6_new_table() inserts table C at the chain head via
hlist_add_head_rcu(), making the chain [C(pos=0), A(pos=1), B(pos=2)].

On the next batch, inet6_dump_fib() resumes with s_e=1 and iterates:

  s_e = cb->args[1];   /* s_e = 1 */
  hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
      if (e < s_e)     /* skip C at pos=0 */
          goto next;
      /* e=1: tb now points to A, not B */
      fib6_dump_table(tb, skb, cb);   /* called with wrong table A */
  }

Inside fib6_dump_table(), w->root is unconditionally overwritten
before the resume branch is entered:

  w->root = &table->tb6_root;        /* now A's root              */
  /* ... */
  } else {
      int sernum = READ_ONCE(w->root->fn_sernum);  /* A's sernum  */
      if (cb->args[5] != sernum) {
          /* sernum changed: safe reset, w->node = w->root (A)    */
          w->node = w->root;
      } else {
          /* sernum unchanged: w->node untouched, still in B       */
          w->skip = 0;
      }
      fib6_walk_continue(w);   /* sernum equal: w->root=A, w->node=B */
  }

The sernum guard was intended to detect tree modifications and reset
the walk, but here the two tables happen to share the same fn_sernum
value (a global flush had previously unified them), so the guard does
not fire and w->node is left pointing into B's tree.

From this point w->root and w->node belong to different tables.  When
fib6_walk_continue() traverses upward and reaches B's root node
(fn->fn_flags & RTN_ROOT), the exit check:

  if (fn == w->root)   /* B's root != A's root, check fails */
      return 0;
  pn = fn->parent;     /* B's root has no parent: pn == NULL */
  left = pn->left;     /* NULL deref -> crash */

 net/ipv6/ip6_fib.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index fc95738de..bda492634 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -636,11 +636,11 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	};
 	const struct nlmsghdr *nlh = cb->nlh;
 	struct net *net = sock_net(skb->sk);
-	unsigned int e = 0, s_e;
 	struct hlist_head *head;
 	struct fib6_walker *w;
 	struct fib6_table *tb;
 	unsigned int h, s_h;
+	u32 s_id;
 	int err = 0;
 
 	rcu_read_lock();
@@ -701,23 +701,22 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	s_h = cb->args[0];
-	s_e = cb->args[1];
+	s_id = cb->args[1];
 
-	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
-		e = 0;
+	for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_id = 0) {
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(tb, head, tb6_hlist) {
-			if (e < s_e)
-				goto next;
+			if (s_id && tb->tb6_id != s_id)
+				continue;
+			s_id = 0;
+
+			cb->args[1] = tb->tb6_id;
 			err = fib6_dump_table(tb, skb, cb);
 			if (err != 0)
 				goto out;
-next:
-			e++;
 		}
 	}
 out:
-	cb->args[1] = e;
 	cb->args[0] = h;
 
 unlock:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net] net: udp_tunnel: fix use-after-free by refcounting udp_tunnel_nic
From: Eric Dumazet @ 2026-06-24 17:10 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Ido Schimmel, David Ahern, netdev, eric.dumazet,
	Eric Dumazet, Yue Sun

Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() during concurrent device operations.

The state flags of struct udp_tunnel_nic were originally bitfields
sharing a byte, modified concurrently without locking (RCU vs worker).
Even after converting to atomic bitops, a single WORK_PENDING flag
races: the workqueue core clears the pending bit before running the
worker. A concurrent queueing sets the flag, but the running worker
clears it, leading to premature freeing in unregister() while the
re-queued work is still active.

Fix this introducing reference counting for struct
udp_tunnel_nic. Increment the refcount on successful queue_work(),
and decrement it at the end of the worker. Defer the dev_put() call
for the last device to the free path to ensure the net_device remains
valid as long as the structure is alive.

Additionally, convert concurrent modifications of the 'missed' bitmap
to atomic operations (set_bit, bitmap_zero) to prevent data races there.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Reported-by: Yue Sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/netdev/20260624090135.95763-1-samsun1006219@gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp_tunnel_nic.c | 75 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/net/ipv4/udp_tunnel_nic.c b/net/ipv4/udp_tunnel_nic.c
index 9944ed923ddfd10f9adf6ad788c0740daeaf2adb..884b5d93b7b39f7f20855ff8ca2ec4d7ef5a9ef6 100644
--- a/net/ipv4/udp_tunnel_nic.c
+++ b/net/ipv4/udp_tunnel_nic.c
@@ -30,9 +30,8 @@ struct udp_tunnel_nic_table_entry {
  * @work:	async work for talking to hardware from process context
  * @dev:	netdev pointer
  * @lock:	protects all fields
- * @need_sync:	at least one port start changed
- * @need_replay: space was freed, we need a replay of all ports
- * @work_pending: @work is currently scheduled
+ * @flags:	sync, replay flags
+ * @refcnt:	reference count
  * @n_tables:	number of tables under @entries
  * @missed:	bitmap of tables which overflown
  * @entries:	table of tables of ports currently offloaded
@@ -44,9 +43,11 @@ struct udp_tunnel_nic {
 
 	struct mutex lock;
 
-	u8 need_sync:1;
-	u8 need_replay:1;
-	u8 work_pending:1;
+	unsigned long flags;
+#define UDP_TUNNEL_NIC_NEED_SYNC	0
+#define UDP_TUNNEL_NIC_NEED_REPLAY	1
+
+	refcount_t refcnt;
 
 	unsigned int n_tables;
 	unsigned long missed;
@@ -116,7 +117,7 @@ udp_tunnel_nic_entry_queue(struct udp_tunnel_nic *utn,
 			   unsigned int flag)
 {
 	entry->flags |= flag;
-	utn->need_sync = 1;
+	set_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 }
 
 static void
@@ -283,7 +284,7 @@ udp_tunnel_nic_device_sync_by_table(struct net_device *dev,
 static void
 __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
-	if (!utn->need_sync)
+	if (!test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags))
 		return;
 
 	if (dev->udp_tunnel_nic_info->sync_table)
@@ -291,21 +292,24 @@ __udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 	else
 		udp_tunnel_nic_device_sync_by_port(dev, utn);
 
-	utn->need_sync = 0;
+	clear_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 	/* Can't replay directly here, in case we come from the tunnel driver's
 	 * notification - trying to replay may deadlock inside tunnel driver.
 	 */
-	utn->need_replay = udp_tunnel_nic_should_replay(dev, utn);
+	if (udp_tunnel_nic_should_replay(dev, utn))
+		set_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
+	else
+		clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 }
 
 static void
 udp_tunnel_nic_device_sync(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
-	if (!utn->need_sync)
+	if (!test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags))
 		return;
 
-	queue_work(udp_tunnel_nic_workqueue, &utn->work);
-	utn->work_pending = 1;
+	if (queue_work(udp_tunnel_nic_workqueue, &utn->work))
+		refcount_inc(&utn->refcnt);
 }
 
 static bool
@@ -348,7 +352,7 @@ udp_tunnel_nic_has_collision(struct net_device *dev, struct udp_tunnel_nic *utn,
 			if (!udp_tunnel_nic_entry_is_free(entry) &&
 			    entry->port == ti->port &&
 			    entry->type != ti->type) {
-				__set_bit(i, &utn->missed);
+				set_bit(i, &utn->missed);
 				return true;
 			}
 		}
@@ -483,7 +487,7 @@ udp_tunnel_nic_add_new(struct net_device *dev, struct udp_tunnel_nic *utn,
 		 * are no devices currently which have multiple tables accepting
 		 * the same tunnel type, and false positives are okay.
 		 */
-		__set_bit(i, &utn->missed);
+		set_bit(i, &utn->missed);
 	}
 
 	return false;
@@ -552,7 +556,7 @@ static void __udp_tunnel_nic_reset_ntf(struct net_device *dev)
 
 	mutex_lock(&utn->lock);
 
-	utn->need_sync = false;
+	clear_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags);
 	for (i = 0; i < utn->n_tables; i++)
 		for (j = 0; j < info->tables[i].n_entries; j++) {
 			struct udp_tunnel_nic_table_entry *entry;
@@ -696,8 +700,8 @@ udp_tunnel_nic_flush(struct net_device *dev, struct udp_tunnel_nic *utn)
 	for (i = 0; i < utn->n_tables; i++)
 		memset(utn->entries[i], 0, array_size(info->tables[i].n_entries,
 						      sizeof(**utn->entries)));
-	WARN_ON(utn->need_sync);
-	utn->need_replay = 0;
+	WARN_ON(test_bit(UDP_TUNNEL_NIC_NEED_SYNC, &utn->flags));
+	clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 }
 
 static void
@@ -713,8 +717,8 @@ udp_tunnel_nic_replay(struct net_device *dev, struct udp_tunnel_nic *utn)
 	for (i = 0; i < utn->n_tables; i++)
 		for (j = 0; j < info->tables[i].n_entries; j++)
 			udp_tunnel_nic_entry_freeze_used(&utn->entries[i][j]);
-	utn->missed = 0;
-	utn->need_replay = 0;
+	bitmap_zero(&utn->missed, UDP_TUNNEL_NIC_MAX_TABLES);
+	clear_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags);
 
 	if (!info->shared) {
 		udp_tunnel_get_rx_info(dev);
@@ -728,6 +732,8 @@ udp_tunnel_nic_replay(struct net_device *dev, struct udp_tunnel_nic *utn)
 			udp_tunnel_nic_entry_unfreeze(&utn->entries[i][j]);
 }
 
+static void udp_tunnel_nic_put(struct udp_tunnel_nic *utn);
+
 static void udp_tunnel_nic_device_sync_work(struct work_struct *work)
 {
 	struct udp_tunnel_nic *utn =
@@ -736,14 +742,15 @@ static void udp_tunnel_nic_device_sync_work(struct work_struct *work)
 	rtnl_lock();
 	mutex_lock(&utn->lock);
 
-	utn->work_pending = 0;
 	__udp_tunnel_nic_device_sync(utn->dev, utn);
 
-	if (utn->need_replay)
+	if (test_bit(UDP_TUNNEL_NIC_NEED_REPLAY, &utn->flags))
 		udp_tunnel_nic_replay(utn->dev, utn);
 
 	mutex_unlock(&utn->lock);
 	rtnl_unlock();
+
+	udp_tunnel_nic_put(utn);
 }
 
 static struct udp_tunnel_nic *
@@ -759,6 +766,7 @@ udp_tunnel_nic_alloc(const struct udp_tunnel_nic_info *info,
 	utn->n_tables = n_tables;
 	INIT_WORK(&utn->work, udp_tunnel_nic_device_sync_work);
 	mutex_init(&utn->lock);
+	refcount_set(&utn->refcnt, 1);
 
 	for (i = 0; i < n_tables; i++) {
 		utn->entries[i] = kzalloc_objs(*utn->entries[i],
@@ -782,9 +790,19 @@ static void udp_tunnel_nic_free(struct udp_tunnel_nic *utn)
 
 	for (i = 0; i < utn->n_tables; i++)
 		kfree(utn->entries[i]);
+
+	if (utn->dev)
+		dev_put(utn->dev);
+
 	kfree(utn);
 }
 
+static void udp_tunnel_nic_put(struct udp_tunnel_nic *utn)
+{
+	if (refcount_dec_and_test(&utn->refcnt))
+		udp_tunnel_nic_free(utn);
+}
+
 static int udp_tunnel_nic_register(struct net_device *dev)
 {
 	const struct udp_tunnel_nic_info *info = dev->udp_tunnel_nic_info;
@@ -863,6 +881,7 @@ static void
 udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 {
 	const struct udp_tunnel_nic_info *info = dev->udp_tunnel_nic_info;
+	bool last = true;
 
 	udp_tunnel_nic_lock(dev);
 
@@ -889,6 +908,7 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 			udp_tunnel_drop_rx_info(dev);
 			utn->dev = first->dev;
 			udp_tunnel_nic_unlock(dev);
+			last = false;
 			goto release_dev;
 		}
 
@@ -901,16 +921,11 @@ udp_tunnel_nic_unregister(struct net_device *dev, struct udp_tunnel_nic *utn)
 	udp_tunnel_nic_flush(dev, utn);
 	udp_tunnel_nic_unlock(dev);
 
-	/* Wait for the work to be done using the state, netdev core will
-	 * retry unregister until we give up our reference on this device.
-	 */
-	if (utn->work_pending)
-		return;
-
-	udp_tunnel_nic_free(utn);
+	udp_tunnel_nic_put(utn);
 release_dev:
 	dev->udp_tunnel_nic = NULL;
-	dev_put(dev);
+	if (!last)
+		dev_put(dev);
 }
 
 static int
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* Re: [PATCH] octeontx2-af: Free BPID bitmap on setup failure
From: Simon Horman @ 2026-06-24 17:09 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: sgoutham, lcherian, gakula, hkelam, sbhatta, andrew+netdev, davem,
	edumazet, kuba, pabeni, netdev, linux-kernel, stable
In-Reply-To: <20260623114316.2182271-1-haoxiang_li2024@163.com>

On Tue, Jun 23, 2026 at 07:43:16PM +0800, Haoxiang Li wrote:
> nix_setup_bpids() allocates bp->bpids with rvu_alloc_bitmap(), which uses
> a plain kcalloc(). If any of the following devm_kcalloc() allocations for
> the BPID mapping arrays fails, the function returns without freeing the
> bitmap. Free the BPID bitmap before returning from those error paths.
> 
> Fixes: d6212d2e41a0 ("octeontx2-af: Create BPIDs free pool")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>

Reviewed-by: Simon Horman <horms@kernel.org>

I am wondering if you did a pass for any other similar problems
with users of rvu_alloc_bitmap.

^ permalink raw reply

* Re: [PATCH v3] nfc: nci: add data_len bound checks to activation parameter extractors
From: David Heidelberg @ 2026-06-24 16:53 UTC (permalink / raw)
  To: hexlabsecurity; +Cc: linux-kernel, Simon Horman, netdev, oe-linux-nfc
In-Reply-To: <20260612-b4-disp-6d52d8b0-v3-1-e26221f8826d@proton.me>


On Fri, 12 Jun 2026 12:50:25 -0500, Bryam Vargas wrote:
 > nfc: nci: add data_len bound checks to activation parameter extractors

Applied, thanks!

[1/1] nfc: nci: add data_len bound checks to activation parameter extractors
       commit: 6f9301aba1ec2f926725c3929a41dd9814231a50

Best regards,
-- 
David Heidelberg <david@ixit.cz>

^ permalink raw reply

* Re: [PATCH 0/2] Fix a few memory bugs in RPC-with-TLS
From: Chuck Lever @ 2026-06-24 16:52 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker
  Cc: linux-nfs, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michael Nemanov, Chuck Lever
In-Reply-To: <20260504-sunrpc-tls-clnt-pin-v1-0-197f359c6072@oracle.com>

Gentle ping on this series, posted seven weeks ago. Michael
Nemanov reviewed and tested both patches the following day; he is
the reporter of the use-after-free that patch 2 addresses on an
mTLS mount whose client certificate the server rejected.

Could one of you queue these for an upcoming release? I am glad
to repost against a current base if that is easier to apply.

--
Chuck Lever

^ permalink raw reply

* Re: [BUG] KFENCE: use-after-free read in udp_tunnel_nic_device_sync_work
From: Eric Dumazet @ 2026-06-24 16:51 UTC (permalink / raw)
  To: Sam Sun
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, netdev,
	linux-kernel, syzkaller
In-Reply-To: <CAEkJfYMZvy1XWuA_y13NfwXef+wUh4jaiQpkzHfr4=HDbr2HAQ@mail.gmail.com>

On Wed, Jun 24, 2026 at 9:36 AM Sam Sun <samsun1006219@gmail.com> wrote:
>

> I tested the refcount version, and I could no longer reproduce the bug with
> the C reproducer. I think this patch fixes the specific lifetime problem
> exposed by the reproducer.
>

Nice, thanks for testing it. I will cook an official patch then.

^ permalink raw reply

* Re: [PATCH net-next v5 1/4] dpll: add DPLL_PIN_TYPE_INT_NCO pin type
From: Ivan Vecera @ 2026-06-24 16:42 UTC (permalink / raw)
  To: Vadim Fedorenko, Kubalewski, Arkadiusz, Jiri Pirko,
	Jakub Kicinski
  Cc: netdev@vger.kernel.org, Jiri Pirko, David S. Miller,
	Donald Hunter, Eric Dumazet, Schmidt, Michal, Paolo Abeni,
	Vaananen, Pasi, Oros, Petr, Prathosh Satish, Simon Horman,
	linux-kernel@vger.kernel.org
In-Reply-To: <0f8fe4e0-72d8-48a6-96ad-d1650919d2df@linux.dev>

On 6/24/26 5:57 PM, Vadim Fedorenko wrote:
> On 19/06/2026 18:07, Ivan Vecera wrote:
>> On 6/17/26 1:59 PM, Kubalewski, Arkadiusz wrote:
>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>> Sent: Monday, June 15, 2026 2:00 PM
>>>>
>>>> On 6/11/26 2:09 PM, Jiri Pirko wrote:
>>>>> Wed, Jun 10, 2026 at 05:45:46PM +0200, ivecera@redhat.com wrote:
>>>>>> On 6/10/26 3:04 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>> Sent: Tuesday, June 9, 2026 4:59 PM
>>>>>>>>
>>>>>>>> On 6/9/26 4:00 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>>> Sent: Tuesday, June 9, 2026 10:51 AM
>>>>>>>>>>
>>>>>>>>>> Mon, Jun 08, 2026 at 07:03:46PM +0200,
>>>>>>>>>> arkadiusz.kubalewski@intel.com
>>>>>>>>>> wrote:
>>>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>>>> Sent: Monday, June 8, 2026 5:48 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 6/8/26 4:43 PM, Kubalewski, Arkadiusz wrote:
>>>>>>>>>>>>>> From: Ivan Vecera <ivecera@redhat.com>
>>>>>>>>>>>>>> Sent: Sunday, May 31, 2026 9:44 PM ...
>>>>>>>>>>>>>>            -
>>>>>>>>>>>>>>              name: gnss
>>>>>>>>>>>>>>              doc: GNSS recovered clock
>>>>>>>>>>>>>> +      -
>>>>>>>>>>>>>> +        name: int-nco
>>>>>>>>>>>>>> +        doc: |
>>>>>>>>>>>>>> +          Device internal numerically controlled oscillator.
>>>>>>>>>>>>>> +          When connected as a DPLL input, the DPLL enters 
>>>>>>>>>>>>>> NCO
>>>>>>>>>>>>>> mode
>>>>>>>>>>>>>> +          where the output frequency is adjusted by the host
>>>>>>>>>>>>>> via
>>>>>>>>>>>>>> +          the PTP clock interface.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ivan!
>>>>>>>>>>>>>
>>>>>>>>>>>>> How would you control this in case of automatic mode dpll?
>>>>>>>>>>>>> Automatic mode DPLL shall be controlled on HW level, such pin
>>>>>>>>>>>>> brakes that rule and requires some driver magic to show it is
>>>>>>>>>>>>> higher priority then the rest of the pins?
>>>>>>>>>>>>
>>>>>>>>>>>> The NCO pin can be connected only in manual mode. In other 
>>>>>>>>>>>> words
>>>>>>>>>>>> a
>>>>>>>>>>>> DPLL in automatic mode cannot select NCO pin (switch to NCO 
>>>>>>>>>>>> mode)
>>>>>>>>>>>> by
>>>>>>>>>>>> its own.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Being picky on DPLL_MODE for enabling feature is not 
>>>>>>>>>>> something we
>>>>>>>>>>> can allow if it is not related to HW limitation, is it?
>>>>>>>>>>> Could you please elaborate why it is not possible for AUTOMATIC
>>>>>>>>>>> mode?
>>>>>>>>>>
>>>>>>>>>> In automatic mode, the pin selection logic is defined upon 
>>>>>>>>>> prio. I
>>>>>>>>>> can imagine that if NCO pin has the highest prio of the available
>>>>>>>>>> ones, it gets picked. I would be aligned 100% with automatic mode
>>>>>>>>>> behaviour.
>>>>>>>>>> Is there a real usecase for it?
>>>>>>>>>>
>>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>> This is not true. AUTOMATIC mode is HW solution, SW driver ONLY
>>>>>>>>> configures priorities on the inputs, not manages the active 
>>>>>>>>> inputs.
>>>>>>>>> This brakes that behavior, the SW driver would have to manually
>>>>>>>>> override the AUTMATIC mode to be fed from such NCO pin as it 
>>>>>>>>> doesn't
>>>>>>>>> exists on it's priority list, HW cannot pick or use it.
>>>>>>>>
>>>>>>>> Correct, AUTO mode is hardware feature and it should not be 
>>>>>>>> emulated
>>>>>>>> by a
>>>>>>>> driver. If the hardware does not support it then the switching
>>>>>>>> between
>>>>>>>> input references should be done by userspace (by monitoring ffo,
>>>>>>>> phase_offset, operstate).
>>>>>>>>
>>>>>>>
>>>>>>> Yes, exactly, so for AUTOMATIC mode HW it will not be possible to
>>>>>>> create
>>>>>>> such pin, which means that NCO pin would serve only a MANUAL mode
>>>>>>> implementation.
>>>>>>> Basically this is something we shall not allow to happen. DPLL API
>>>>>>> should be designed to cover the case where AUTO mode is able to
>>>>>>> implement
>>>>>>> all features consistently.
>>>>>>
>>>>>> If you don't like the proposal from Jiri (NCO switch driven by NCO 
>>>>>> pin
>>>>>> priority -> highest==enter_nco else leave_nco) then it could be
>>>>>> possible
>>>>>> to handle the switching by allowing the state 'connected' in AUTO 
>>>>>> mode
>>>>>> for the NCO pin type. Then the implementation will be the same for 
>>>>>> both
>>>>>> selection modes.
>>>>>>
>>>>>> Only difference would be that a user does not need to switch the 
>>>>>> device
>>>>> >from the AUTO to MANUAL mode.
>>>>>>
>>>>>>>>> The real use case is that any DPLL can switch the mode to this one
>>>>>>>>> instead of implementing MANUAL mode just to use the feature with a
>>>>>>>>> 'virtual' pin.
>>>>>>>>
>>>>>>>> I don't expect this... but it is up to a driver. I don't plan such
>>>>>>>> functionality in zl3073x as the NCO pin does not expose prio_get()
>>>>>>>> and
>>>>>>>> prio_set() callbacks - so it is clear that this pin cannot be 
>>>>>>>> part of
>>>>>>>> the
>>>>>>>> automatic selection.
>>>>>>>>
>>>>>>>> Ivan
>>>>>>>
>>>>>>> There is a difference between particular HW and API capabilities, 
>>>>>>> with
>>>>>>> the
>>>>>>> proposed API we would disallow the possibility of such 
>>>>>>> implementation
>>>>>>> for
>>>>>>> existing HW variants.
>>>>>>>
>>>>>>> DPLL NCO MODE would allow that but as pointed here by Ivan and by 
>>>>>>> Jiri
>>>>>>> in
>>>>>>> the other email it would also require the extra implementation for
>>>>>>> some
>>>>>>> configuration - device level phase/ffo handling.
>>>>>>>
>>>>>>> To summarize it all, I don't have such simple solution for it.
>>>>>>>
>>>>>>> First thing that comes to my mind is to combine both approaches.
>>>>>>> Make it possible for AUTMATIC mode to also set "CONNECTED" state
>>>>>>> on certain kind of "OVERRIDE" pins, where it could be determined by
>>>>>>> the type of PIN and embed that logic into the DPLL subsystem.
>>>>>>
>>>>>> The possible states for particual pins are now handled at a driver
>>>>>> level
>>>>>> so the driver decides if the requested state is correct or not. So it
>>>>>> could be easy to implement this.
>>>>>>
>>>>>> For auto mode allowed states:
>>>>>> - input references: selectable / disconnected
>>>>>> - nco pin: connected / disconnected
>>>>>>
>>>>>>> Basically, if driver registers such NCO pin it would be always
>>>>>>> selected
>>>>>>> manually, and in such case all the other pins are going to
>>>>>>> disconnected
>>>>>>> state while DPLL mode is also a "OVERRIDE" or something like it.
>>>>>>
>>>>>> I would leave this decision on the driver level... Imagine the
>>>>>> potential
>>>>>> HW that would allow to switch NCO mode if there is no valid input
>>>>>> reference.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> REF0 (prio 0) -> +------+ -> OUT0
>>>>>> REF1 (prio 1) -> | DPLL | -> ...
>>>>>> NCO  (prio 2) -> +------+ -> OUTn
>>>>>>
>>>>>> Such HW would prefer REF0 or REF1 and lock to one of them if they are
>>>>>> qualified. But if they are NOT, then it switches to NCO mode.
>>>
>>> Now you said yourself "NCO mode" ... I agree that it would be a mode in
>>> that case. Where instead of running on regular/built in XO dpll would 
>>> run
>>> on NCO and user could select it, and this would be addition to regular
>>> behavior.
>>>
>>> I also agree that the pin approach might be better/easier to use, 
>>> assuming
>>> frequency offset for all the outputs given dpll drives, it makes more 
>>> sense
>>> to have it configurable on input side.
>>
>> +1
>>
>>>>>>
>>>>>> In this situation the relevant driver would allow to configure 
>>>>>> priority
>>>>>> and state 'selectable' for this NCO pin.
>>>>>>
>>>>>>> Perhaps the pin type could include OVERRIDE in it's name to make it
>>>>>>> less
>>>>>>> confusing and needs some extra documentation.
>>>>>>>
>>>>>>> Thoughts?
>>>>>> I think _INT_ is ok. In the case of TYPE_INT_OSCILLATOR it is also
>>>>>> obvious that it is not a standard input reference.
>>>>>>
>>>>>> Jiri, Vadim, Arek, thoughts?
>>>>>
>>>>> I agree with you, the driver should have the flexibility to implement
>>>>> this according to his/hw's needs/capabilities. If it implements prio
>>>>> selection in AUTO mode, let it have it. If it implements manual NCO 
>>>>> pin
>>>>> selection in AUTO mode using connected/disconnected override, let it
>>>>> have it.
>>>
>>> I don't know 'current' HW that is capable of using AUTO mode as a 
>>> part of
>>> HW-based priority source selection and use such NCO input..
>>> But as already explained above, this is special mode of regular XO, 
>>> which
>>> allows DPLL's output frequency offset configuration.
>>
>> Lets keep this available for potential future HW. I can imagine a
>> situation where a user will prefer an automatic switch to NCO mode
>> if there is no qualified input reference - automatic switch means
>> that HW will support this (not emulated by the driver).
>>
>>>>>
>>>>> Moreover, I actually like the "override" capability for pins in AUTO
>>>>> mode in general. It may be handy for other usecases as well.
>>>>>
>>>> Arek? Vadim?
>>>>
>>>> Thanks,
>>>> Ivan
>>>
>>> Agree, 'override' capability of a pin would be the way to go for this 
>>> and
>>> other similar further cases.
>>>
>>> I believe a single approach on this would be best, I mean if AUTO mode
>>> needs a capability, to switch from regular behavior to 'OVERRIDE', and
>>> 'OVERRIDE' is only pin capability that allows such behavior for AUTO
>>> mode, then similar approach should be used on MANUAL mode, to make
>>> userspace know that such pin is always available to set "CONNECTED"
>>> and make the userspace implementation consistent on enabling it no 
>>> matter
>>> if AUTO or MANUAL mode dpll.
>>
>> Proposal:
>> 1) new pin capability
>>     - name: state-connected-override
>>     - doc: pin state can be changed to connected in any DPLL mode
>>
>> 2) new NCO pin type to switch the DPLL to NCO mode when connected
>>
>> 3) automatic-only DPLL
>>     - should expose NCO pin with state-connected-override capability
>>
>> 4) manual-only DPLL
>>    - does not need to expose NCO pin with state-connected-override cap
>>
>> 5) dual-mode DPLL (supporting mode switching)
>>    - if it exposes NCO pin with the override cap then it has to support
>>      switching to NCO mode directly from AUTO mode
>>    - if does not expose NCO pin with the override cap then a user MUST
>>      switch the DPLL mode from AUTO to MANUAL to be able to make NCO
>>      pin connected to the DPLL
> 
> I still don't see good reasoning for the pin. Even this sentence says
> "DPLL mode" which keeps me thinking whether we have to move it to a
> special DPLL mode. All these items look like overcomplication of a
> simple function of the device itself. DPLL can be either in the closed
> loop when one of the pins provides a signal to align to, or in the open
> loop meaning that software can control adjustments to phase/frequency.
> But it's definitely a property of the device, and it's not a pin in any
> kind...

Yes, at first glance, introducing a new DPLL mode would be easier and
simpler. But it depends on how you look at it.

What is NCO mode and how does it differ from other modes?

In other modes, the DPLL loop is controlled by the input signal that the
DPLL is locked to. Changes in frequency (FFO) and phase (phase offset)
affect the output signal(s) that are produced from this DPLL.

In NCO mode, the frequency and, consequently, phase changes are
determined by software (e.g. ptp4l userspace via the PTP API)... these
changes from userspace can be understood as an input signal that affects
the DPLL outputs.

In addition, we could use the existing DPLL pin attributes to control
the DPLL device in NCO mode.

I mean specifically:
* fractional-frequency-offset-ppt as a variant of .adjfine(), but with
   a higher resolution (adjfine has a resolution of 15.25 ppt)
* phase-adjust as a higher-resolution variant of .adjphase() (adjphase
   uses nanoseconds, whereas phase-adjust uses picoseconds)

So:
'dpll pin set id <nco-pin> fractional-frequency-offset-ppt X'
  corresponds to PTP's adjfine(X-as-scaled-ppm)
'dpll pin set id <nco-pin> phase-adjust Y' corresponds to PTP's
  adjphase(Y/1000)

So in short, the NCO pin represents a software "signal" from userspace.

If we were to go the special DPLL mode route and want to implement this
functionality in the DPLL subsystem in the future, we would have to add
extra attributes (ffo, phase) to dpll_device.

WDYT?

Thanks,
Ivan


^ permalink raw reply

* Re: [PATCH net v7 3/4] iavf: send MAC change request synchronously
From: Simon Horman @ 2026-06-24 16:31 UTC (permalink / raw)
  To: jtornosm
  Cc: Simon Horman, netdev, intel-wired-lan, przemyslaw.kitszel,
	aleksandr.loktionov, jacob.e.keller, anthony.l.nguyen, davem,
	edumazet, kuba, pabeni
In-Reply-To: <20260623101800.991293-4-jtornosm@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
iavf: send MAC change request synchronously

This commit fixes a deadlock in iavf_set_mac() by making the MAC
address change synchronous. It introduces iavf_poll_virtchnl_response()
to directly poll the admin queue instead of waiting on the watchdog task.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> index 630388e9d28cb..3fa288e3798ae 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
[ ... ]
> @@ -1049,25 +1103,23 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
>  		return -EADDRNOTAVAIL;
>  
>  	ret = iavf_replace_primary_mac(adapter, addr->sa_data);
> -
>  	if (ret)
>  		return ret;
>  
> -	ret = wait_event_interruptible_timeout(adapter->vc_waitqueue,
> -					       iavf_is_mac_set_handled(netdev, addr->sa_data),
> -					       msecs_to_jiffies(2500));
> -
> -	/* If ret < 0 then it means wait was interrupted.
> -	 * If ret == 0 then it means we got a timeout.
> -	 * else it means we got response for set MAC from PF,
> -	 * check if netdev MAC was updated to requested MAC,
> -	 * if yes then set MAC succeeded otherwise it failed return -EACCES
> -	 */
> -	if (ret < 0)
> +	ret = iavf_set_mac_sync(adapter, addr->sa_data);
> +	if (ret) {
> +		/* Rollback only if send failed (message never reached PF).
> +		 * Don't rollback on timeout (-EAGAIN) because the message was
> +		 * sent and PF will eventually respond. When the response arrives,
> +		 * iavf_virtchnl_completion() will handle rollback (on PF error)
> +		 * or acceptance (on PF success) automatically.
> +		 */
> +		if (ret != -EAGAIN) {
> +			iavf_mac_add_reject(adapter);
> +			ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
> +		}
>  		return ret;
> -
> -	if (!ret)
> -		return -EAGAIN;
>  
>  	if (!ether_addr_equal(netdev->dev_addr, addr->sa_data))
>  		return -EACCES;

[Severity: High]
Will this cause a persistent state mismatch if a timeout is followed by a
retry?

If iavf_set_mac_sync() times out, it returns -EAGAIN and bypasses this
rollback, leaving adapter->hw.mac.addr updated but netdev->dev_addr at the
old MAC.

If the user then retries before the PF asynchronously completes the original
request, iavf_add_ether_addrs() returns -EBUSY. Since -EBUSY != -EAGAIN,
the rollback runs, forcibly reverting adapter->hw.mac.addr to the old MAC.

When the PF's delayed success response finally arrives,
iavf_virtchnl_completion() sees the old MAC in both hw.mac.addr and
netdev->dev_addr, bypassing the netdev MAC update. The PF hardware is now
programmed with the new MAC, but the kernel retains the old MAC, dropping
traffic.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
> index ec234cc8bd9db..e6b7e8f82c7c1 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
[ ... ]
> @@ -555,20 +556,23 @@ iavf_set_mac_addr_type(struct virtchnl_ether_addr *virtchnl_ether_addr,
>   * @adapter: adapter structure
>   *
>   * Request that the PF add one or more addresses to our filters.
> - **/
> -void iavf_add_ether_addrs(struct iavf_adapter *adapter)
> + *
> + * Return: 0 on success, negative on failure
> + */
> +int iavf_add_ether_addrs(struct iavf_adapter *adapter)
>  {
>  	struct virtchnl_ether_addr_list *veal;
>  	struct iavf_mac_filter *f;
>  	int i = 0, count = 0;
>  	bool more = false;
>  	size_t len;
> +	int ret;
>  
>  	if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
>  		/* bail because we already have a command pending */
>  		dev_err(&adapter->pdev->dev, "Cannot add filters, command %d pending\n",
>  			adapter->current_op);
> -		return;
> +		return -EBUSY;
>  	}

[Severity: High]
Could routine background tasks cause MAC changes to frequently fail?

The watchdog task routinely sets adapter->current_op for background commands
like VIRTCHNL_OP_GET_STATS. If iavf_set_mac() is called during this window,
iavf_set_mac_sync() will hit this check and immediately fail the MAC change
with -EBUSY.

Because there is no retry logic in iavf_set_mac(), this error propagates
directly to userspace, making MAC address configuration randomly fail based
on watchdog timing.

[ ... ]
> @@ -712,8 +724,8 @@ static void iavf_mac_add_ok(struct iavf_adapter *adapter)
>   * @adapter: adapter structure
>   *
>   * Remove filters from list based on PF response.
> - **/
> -static void iavf_mac_add_reject(struct iavf_adapter *adapter)
> + */
> +void iavf_mac_add_reject(struct iavf_adapter *adapter)
>  {
>  	struct net_device *netdev = adapter->netdev;
>  	struct iavf_mac_filter *f, *ftmp;

[Severity: Medium]
Does exporting and reusing iavf_mac_add_reject() for local rollback cause
unintended background MAC changes?

When the user requests a MAC address that is already in the filter list
(e.g., a secondary MAC), iavf_add_filter() sets f->is_new_mac = false and
f->add = true, and flags IAVF_FLAG_AQ_ADD_MAC_FILTER.

If the synchronous operation fails locally (like the -EBUSY return above),
iavf_set_mac() calls this function. However, iavf_mac_add_reject() only
clears the pending state and removes the filter if f->is_new_mac is true:

    if (f->is_new_mac) {
        list_del(&f->list);
        kfree(f);
    }

Reused filters are ignored, leaving f->add = true and the
IAVF_FLAG_AQ_ADD_MAC_FILTER flag active. The background watchdog task will
eventually process this flag and blindly send the MAC configuration to the
PF, even though the VF already aborted the operation locally.

^ permalink raw reply

* [PATCH net v3 11/11] rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate
From: David Howells @ 2026-06-24 16:38 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Jeffrey Altman, stable
In-Reply-To: <20260624163819.3017002-1-dhowells@redhat.com>

Fix rxrpc_rotate_tx_rotate() to check that there's something in the
transmission buffer to be rotated before it attempts to rotate anything.

Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
Link: https://sashiko.dev/#/patchset/20260618134802.2477777-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 9bd0f1b92463..73cafe6bfa9f 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -237,6 +237,9 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
 		call->acks_lowest_nak = to;
 	}
 
+	if (after(seq, to))
+		return false;
+
 	/* We may have a left over fully-consumed buffer at the front that we
 	 * couldn't drop before (rotate_and_keep below).
 	 */


^ permalink raw reply related

* [PATCH net v3 09/11] rxrpc: Fix socket notification race
From: David Howells @ 2026-06-24 16:38 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Jeffrey Altman, stable
In-Reply-To: <20260624163819.3017002-1-dhowells@redhat.com>

There's a race between rxrpc_recvmsg() and rxrpc_notify_socket(), whereby
the latter's attempt to avoid disabling interrupts and taking the socket's
recvmsg_lock if the call is already queued may happen simultaneously with
the former's discarding of a call that has nothing queued.

Fix this by removing the shortcut.  Note that this only affects userspace's
use of AF_RXRPC; the AFS filesystem driver doesn't use the socket queue.

Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/recvmsg.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index f382a47c6eb0..9962e135cb73 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -27,8 +27,6 @@ void rxrpc_notify_socket(struct rxrpc_call *call)
 
 	_enter("%d", call->debug_id);
 
-	if (!list_empty(&call->recvmsg_link))
-		return;
 	if (test_bit(RXRPC_CALL_RELEASED, &call->flags)) {
 		rxrpc_see_call(call, rxrpc_call_see_notify_released);
 		return;


^ permalink raw reply related

* [PATCH net v3 10/11] rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
From: David Howells @ 2026-06-24 16:38 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Jeffrey Altman, stable
In-Reply-To: <20260624163819.3017002-1-dhowells@redhat.com>

Fix rxrpc_recvmsg() to also drop the ref it holds on an already-released
call if MSG_PEEK is in force (the function holds a ref on the call
irrespective of whether MSG_PEEK is specified or not).

Fixes: 962fb1f651c2 ("rxrpc: Fix recv-recv race of completed call")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/recvmsg.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 9962e135cb73..efcba4b2e74f 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -529,8 +529,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	if (test_bit(RXRPC_CALL_RELEASED, &call->flags)) {
 		rxrpc_see_call(call, rxrpc_call_see_already_released);
 		mutex_unlock(&call->user_mutex);
-		if (!(flags & MSG_PEEK))
-			rxrpc_put_call(call, rxrpc_call_put_recvmsg);
+		rxrpc_put_call(call, rxrpc_call_put_recvmsg);
 		goto try_again;
 	}
 


^ permalink raw reply related

* [PATCH net v3 08/11] rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
From: David Howells @ 2026-06-24 16:38 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Jeffrey Altman, stable
In-Reply-To: <20260624163819.3017002-1-dhowells@redhat.com>

Fix the wait in rxrpc_recvmsg() also take check the oob queue.

Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/recvmsg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 39a03684432d..f382a47c6eb0 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -438,7 +438,8 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 		return -EAGAIN;
 	}
 
-	if (list_empty(&rx->recvmsg_q)) {
+	if (list_empty(&rx->recvmsg_q) &&
+	    skb_queue_empty_lockless(&rx->recvmsg_oobq)) {
 		ret = -EWOULDBLOCK;
 		if (timeo == 0) {
 			call = NULL;


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox