[PATCH net 0/4] net: avoid nested UP notifier events

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 0/4] net: avoid nested UP notifier events
@ 2026-06-24 18:20 Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility Jakub Kicinski
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski

syzbot reported that recent ethtool rework leads to deadlock
on stacked devices. VLANs create nested notifications, confusing
execution context. Bringing up dummy causes vlan to bring itself
up as well. Which in turn causes bond to ask for link state -
a call chain traveling in the opposite direction.

  bond    (3) bond_update_speed_duplex(vlan)
    |           ^                v
  vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
    |           ^                v
  dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()

We locked the instance lock of dummy at (1) and will will
try to lock it again at (5) - which of course deadlocks.

For non-nested notifications this is avoided because NETDEV_UP
is always run ops-locked (so that bond asks for link using the
netif_ API which assumes instance lock already held). The nesting,
however, makes this problematic, we cannot carry the state of
the whole chain back in the opposite direction.

AFAICT vlan is the only driver which causes such issues.
So let's try a localized fix of deferring vlan auto-open
to a workqueue.

Jakub Kicinski (4):
  net: turn the rx_mode work into a generic netdev_work facility
  net: add the driver-facing netdev_work scheduling API
  vlan: defer real device state propagation to netdev_work
  selftests: bonding: add a test for VLAN propagation over a bonded real
    device

 Documentation/networking/netdevices.rst       |   2 +
 net/core/Makefile                             |   2 +-
 .../selftests/drivers/net/bonding/Makefile    |   1 +
 include/linux/netdevice.h                     |  21 +-
 net/8021q/vlan.h                              |  11 ++
 net/core/dev.h                                |  11 +-
 net/8021q/vlan.c                              |  76 +-------
 net/8021q/vlan_dev.c                          |  60 ++++++
 net/core/dev.c                                |   2 +
 net/core/dev_addr_lists.c                     |  77 +-------
 net/core/netdev_work.c                        | 162 ++++++++++++++++
 .../drivers/net/bonding/bond_vlan_real_dev.sh | 180 ++++++++++++++++++
 12 files changed, 457 insertions(+), 148 deletions(-)
 create mode 100644 net/core/netdev_work.c
 create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh

-- 
2.54.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
  2026-06-24 18:20 [PATCH net 0/4] net: avoid nested UP notifier events Jakub Kicinski
@ 2026-06-24 18:20 ` Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API Jakub Kicinski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski

The rx_mode update runs from a workqueue: drivers have their
ndo_set_rx_mode_async() callback executed by a single global
work item under RTNL and ops lock. This is a useful pattern.

Support multiple "events" that need to be serviced and make RX_MODE
sync the first one. Call the events "core" because later on
we will let drivers define and schedule their own.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/core/Makefile         |   2 +-
 include/linux/netdevice.h |  10 ++--
 net/core/dev.h            |  11 +++-
 net/core/dev.c            |   1 +
 net/core/dev_addr_lists.c |  77 +------------------------
 net/core/netdev_work.c    | 117 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 138 insertions(+), 80 deletions(-)
 create mode 100644 net/core/netdev_work.c

diff --git a/net/core/Makefile b/net/core/Makefile
index dc17c5a61e9a..b3fdcb4e355f 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y		     += dev.o dev_api.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
 			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
 			fib_notifier.o xdp.o flow_offload.o gro.o \
-			netdev-genl.o netdev-genl-gen.o gso.o
+			netdev-genl.o netdev-genl-gen.o netdev_work.o gso.o
 
 obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b67a12541eac..732506787db3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1930,8 +1930,9 @@ enum netdev_reg_state {
  *				has been enabled due to the need to listen to
  *				additional unicast addresses in a device that
  *				does not implement ndo_set_rx_mode()
- *	@rx_mode_node:		List entry for rx_mode work processing
- *	@rx_mode_tracker:	Refcount tracker for rx_mode work
+ *	@work_node:		List entry for async netdev_work processing
+ *	@work_tracker:		Refcount tracker for async netdev_work
+ *	@work_core_pending:	Core-defined pending netdev_work (NETDEV_WORK_*)
  *	@rx_mode_addr_cache:	Recycled snapshot entries for rx_mode work
  *	@rx_mode_retry_timer:	Timer that re-queues rx_mode work after failure
  *	@rx_mode_retry_count:	Number of consecutive retries already scheduled
@@ -2326,8 +2327,9 @@ struct net_device {
 	unsigned int		promiscuity;
 	unsigned int		allmulti;
 	bool			uc_promisc;
-	struct list_head	rx_mode_node;
-	netdevice_tracker	rx_mode_tracker;
+	struct list_head	work_node;
+	netdevice_tracker	work_tracker;
+	unsigned long		work_core_pending;
 	struct netdev_hw_addr_list	rx_mode_addr_cache;
 	struct timer_list	rx_mode_retry_timer;
 	unsigned int		rx_mode_retry_count;
diff --git a/net/core/dev.h b/net/core/dev.h
index 4121c50e7c88..5d0b0305d3ba 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -167,10 +167,19 @@ int dev_change_carrier(struct net_device *dev, bool new_carrier);
 void __dev_set_rx_mode(struct net_device *dev);
 int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify);
 void netif_rx_mode_init(struct net_device *dev);
-bool netif_rx_mode_clean(struct net_device *dev);
+void netif_rx_mode_run(struct net_device *dev);
 void netif_rx_mode_sync(struct net_device *dev);
 void netif_rx_mode_cancel_retry(struct net_device *dev);
 
+/* Events for the async netdev work, tracked in netdev->work_core_pending. */
+enum netdev_work_core {
+	NETDEV_WORK_RX_MODE	= BIT(0),	/* run the rx_mode update */
+};
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event);
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask);
+
 void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
 			unsigned int gchanges, u32 portid,
 			const struct nlmsghdr *nlh);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5c01dfaa6c44..e1d8af0ef6ab 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12093,6 +12093,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->ptype_all);
 	INIT_LIST_HEAD(&dev->ptype_specific);
 	INIT_LIST_HEAD(&dev->net_notifier_list);
+	INIT_LIST_HEAD(&dev->work_node);
 #ifdef CONFIG_NET_SCHED
 	hash_init(dev->qdisc_hash);
 #endif
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e17f64a65e17..08528ca0a8b3 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -12,17 +12,10 @@
 #include <linux/export.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
-#include <linux/workqueue.h>
 #include <kunit/visibility.h>
 
 #include "dev.h"
 
-static void netdev_rx_mode_work(struct work_struct *work);
-
-static LIST_HEAD(rx_mode_list);
-static DEFINE_SPINLOCK(rx_mode_lock);
-static DECLARE_WORK(rx_mode_work, netdev_rx_mode_work);
-
 /*
  * General list handling functions
  */
@@ -1281,7 +1274,7 @@ void netif_rx_mode_cancel_retry(struct net_device *dev)
 	dev->rx_mode_retry_count = 0;
 }
 
-static void netif_rx_mode_run(struct net_device *dev)
+void netif_rx_mode_run(struct net_device *dev)
 {
 	struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref;
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -1339,49 +1332,9 @@ static void netif_rx_mode_run(struct net_device *dev)
 	}
 }
 
-static void netdev_rx_mode_work(struct work_struct *work)
-{
-	struct net_device *dev;
-
-	rtnl_lock();
-
-	while (true) {
-		spin_lock_bh(&rx_mode_lock);
-		if (list_empty(&rx_mode_list)) {
-			spin_unlock_bh(&rx_mode_lock);
-			break;
-		}
-		dev = list_first_entry(&rx_mode_list, struct net_device,
-				       rx_mode_node);
-		list_del_init(&dev->rx_mode_node);
-		/* We must free netdev tracker under
-		 * the spinlock protection.
-		 */
-		netdev_tracker_free(dev, &dev->rx_mode_tracker);
-		spin_unlock_bh(&rx_mode_lock);
-
-		netdev_lock_ops(dev);
-		netif_rx_mode_run(dev);
-		netdev_unlock_ops(dev);
-		/* Use __dev_put() because netdev_tracker_free() was already
-		 * called above. Must be after netdev_unlock_ops() to prevent
-		 * netdev_run_todo() from freeing the device while still in use.
-		 */
-		__dev_put(dev);
-	}
-
-	rtnl_unlock();
-}
-
 static void netif_rx_mode_queue(struct net_device *dev)
 {
-	spin_lock_bh(&rx_mode_lock);
-	if (list_empty(&dev->rx_mode_node)) {
-		list_add_tail(&dev->rx_mode_node, &rx_mode_list);
-		netdev_hold(dev, &dev->rx_mode_tracker, GFP_ATOMIC);
-	}
-	spin_unlock_bh(&rx_mode_lock);
-	schedule_work(&rx_mode_work);
+	__netdev_work_core_sched(dev, NETDEV_WORK_RX_MODE);
 }
 
 static void netif_rx_mode_retry(struct timer_list *t)
@@ -1394,7 +1347,6 @@ static void netif_rx_mode_retry(struct timer_list *t)
 
 void netif_rx_mode_init(struct net_device *dev)
 {
-	INIT_LIST_HEAD(&dev->rx_mode_node);
 	__hw_addr_init(&dev->rx_mode_addr_cache);
 	timer_setup(&dev->rx_mode_retry_timer, netif_rx_mode_retry, 0);
 }
@@ -1442,24 +1394,6 @@ void dev_set_rx_mode(struct net_device *dev)
 	netif_addr_unlock_bh(dev);
 }
 
-bool netif_rx_mode_clean(struct net_device *dev)
-{
-	bool clean = false;
-
-	spin_lock_bh(&rx_mode_lock);
-	if (!list_empty(&dev->rx_mode_node)) {
-		list_del_init(&dev->rx_mode_node);
-		clean = true;
-		/* We must release netdev tracker under
-		 * the spinlock protection.
-		 */
-		netdev_tracker_free(dev, &dev->rx_mode_tracker);
-	}
-	spin_unlock_bh(&rx_mode_lock);
-
-	return clean;
-}
-
 /**
  * netif_rx_mode_sync() - sync rx mode inline
  * @dev: network device
@@ -1473,11 +1407,6 @@ bool netif_rx_mode_clean(struct net_device *dev)
  */
 void netif_rx_mode_sync(struct net_device *dev)
 {
-	if (netif_rx_mode_clean(dev)) {
+	if (__netdev_work_core_cancel(dev, NETDEV_WORK_RX_MODE))
 		netif_rx_mode_run(dev);
-		/* Use __dev_put() because netdev_tracker_free() was already
-		 * called inside netif_rx_mode_clean().
-		 */
-		__dev_put(dev);
-	}
 }
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
new file mode 100644
index 000000000000..c121c24dc493
--- /dev/null
+++ b/net/core/netdev_work.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <net/netdev_lock.h>
+
+#include "dev.h"
+
+static void netdev_work_proc(struct work_struct *work);
+
+/* @netdev_work_lock protects:
+ *  - @netdev_work_list
+ *  - within the list entries (struct net_device fields):
+ *	- work_node
+ *	- work_tracker
+ *	- work_core_pending
+ */
+static LIST_HEAD(netdev_work_list);
+static DEFINE_SPINLOCK(netdev_work_lock);
+static DECLARE_WORK(netdev_work, netdev_work_proc);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+{
+	spin_lock_bh(&netdev_work_lock);
+	if (list_empty(&dev->work_node)) {
+		list_add_tail(&dev->work_node, &netdev_work_list);
+		netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
+	}
+	dev->work_core_pending |= event;
+	spin_unlock_bh(&netdev_work_lock);
+
+	schedule_work(&netdev_work);
+}
+
+/**
+ * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * @dev: net_device
+ * @mask: events to cancel
+ *
+ * Clear @mask from the device's work pending mask. If no work is left pending
+ * the device is dequeued.
+ *
+ * No expectations on locking, but also no guarantees provided. If the caller
+ * wants to touch @dev afterwards (e.g. call the work that got canceled)
+ * they have to ensure @dev does not get freed.
+ *
+ * Returns: the subset of @mask that was actually pending, so the caller can run
+ * those events inline.
+ */
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
+{
+	unsigned long event;
+
+	spin_lock_bh(&netdev_work_lock);
+	event = dev->work_core_pending & mask;
+	dev->work_core_pending &= ~mask;
+	if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
+		list_del_init(&dev->work_node);
+		netdev_put(dev, &dev->work_tracker);
+	}
+	spin_unlock_bh(&netdev_work_lock);
+
+	return event;
+}
+
+static void netdev_work_proc(struct work_struct *work)
+{
+	rtnl_lock();
+
+	while (true) {
+		netdevice_tracker tracker;
+		struct net_device *dev;
+		unsigned long core = 0;
+
+		spin_lock_bh(&netdev_work_lock);
+		if (list_empty(&netdev_work_list)) {
+			spin_unlock_bh(&netdev_work_lock);
+			break;
+		}
+		dev = list_first_entry(&netdev_work_list, struct net_device,
+				       work_node);
+		/* Take a temporary reference so @dev can't be freed while we
+		 * drop the lock to grab its ops lock; the work reference is
+		 * only released once we claim the work below.
+		 * The re-locking dance is to ensure that ops lock is enough
+		 * to ensure canceling work is not racy with dequeue.
+		 */
+		netdev_hold(dev, &tracker, GFP_ATOMIC);
+		spin_unlock_bh(&netdev_work_lock);
+
+		netdev_lock_ops(dev);
+		spin_lock_bh(&netdev_work_lock);
+		if (!list_empty(&dev->work_node)) {
+			list_del_init(&dev->work_node);
+			core = dev->work_core_pending;
+			dev->work_core_pending = 0;
+			/* We took another ref above */
+			netdev_put(dev, &dev->work_tracker);
+
+			if (!dev_isalive(dev))
+				core = 0;
+		}
+		spin_unlock_bh(&netdev_work_lock);
+
+		if (core & NETDEV_WORK_RX_MODE)
+			netif_rx_mode_run(dev);
+		netdev_unlock_ops(dev);
+
+		netdev_put(dev, &tracker);
+	}
+
+	rtnl_unlock();
+}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API
  2026-06-24 18:20 [PATCH net 0/4] net: avoid nested UP notifier events Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility Jakub Kicinski
@ 2026-06-24 18:20 ` Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 3/4] vlan: defer real device state propagation to netdev_work Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device Jakub Kicinski
  3 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski

With an extra event mask we can easily extend the netdev work
to also service driver-defined events. For advanced drivers
this is probably not a perfect match, but it makes running
deferred work easier in simple cases.

Expose the netdev_work facility to drivers. Add helpers
to schedule work and a dedicated ndo to perform the driver-
-scheduled actions.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/linux/netdevice.h | 11 ++++++
 net/core/netdev_work.c    | 81 ++++++++++++++++++++++++++++++---------
 2 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 732506787db3..9981d637f8b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1131,6 +1131,9 @@ struct netdev_net_notifier {
  *	netdev_hw_addr_list_for_each(ha, uc). Return 0 on success or a
  *	negative errno to request a retry via the core backoff.
  *
+ * void (*ndo_work)(struct net_device *dev, unsigned long events);
+ *	Run deferred work scheduled with netdev_work_sched(@events).
+ *
  * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
  *	This function  is called when the Media Access Control address
  *	needs to be changed. If this interface is not defined, the
@@ -1460,6 +1463,8 @@ struct net_device_ops {
 					struct net_device *dev,
 					struct netdev_hw_addr_list *uc,
 					struct netdev_hw_addr_list *mc);
+	void			(*ndo_work)(struct net_device *dev,
+					    unsigned long events);
 	int			(*ndo_set_mac_address)(struct net_device *dev,
 						       void *addr);
 	int			(*ndo_validate_addr)(struct net_device *dev);
@@ -1932,6 +1937,8 @@ enum netdev_reg_state {
  *				does not implement ndo_set_rx_mode()
  *	@work_node:		List entry for async netdev_work processing
  *	@work_tracker:		Refcount tracker for async netdev_work
+ *	@work_pending:		Driver-defined pending netdev_work, passed to
+ *				ndo_work() (see netdev_work_sched())
  *	@work_core_pending:	Core-defined pending netdev_work (NETDEV_WORK_*)
  *	@rx_mode_addr_cache:	Recycled snapshot entries for rx_mode work
  *	@rx_mode_retry_timer:	Timer that re-queues rx_mode work after failure
@@ -2329,6 +2336,7 @@ struct net_device {
 	bool			uc_promisc;
 	struct list_head	work_node;
 	netdevice_tracker	work_tracker;
+	unsigned long		work_pending;
 	unsigned long		work_core_pending;
 	struct netdev_hw_addr_list	rx_mode_addr_cache;
 	struct timer_list	rx_mode_retry_timer;
@@ -5178,6 +5186,9 @@ void dev_fetch_sw_netstats(struct rtnl_link_stats64 *s,
 			   const struct pcpu_sw_netstats __percpu *netstats);
 void dev_get_tstats64(struct net_device *dev, struct rtnl_link_stats64 *s);
 
+void netdev_work_sched(struct net_device *dev, unsigned long events);
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask);
+
 enum {
 	NESTED_SYNC_IMM_BIT,
 	NESTED_SYNC_TODO_BIT,
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
index c121c24dc493..3109fae132ad 100644
--- a/net/core/netdev_work.c
+++ b/net/core/netdev_work.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <linux/export.h>
 #include <linux/list.h>
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
@@ -16,32 +17,63 @@ static void netdev_work_proc(struct work_struct *work);
  *  - within the list entries (struct net_device fields):
  *	- work_node
  *	- work_tracker
+ *	- work_pending
  *	- work_core_pending
  */
 static LIST_HEAD(netdev_work_list);
 static DEFINE_SPINLOCK(netdev_work_lock);
 static DECLARE_WORK(netdev_work, netdev_work_proc);
 
-void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+static void netdev_work_enqueue(struct net_device *dev, unsigned long events,
+				unsigned long core)
 {
+	if (!events && !core)
+		return;
+
 	spin_lock_bh(&netdev_work_lock);
 	if (list_empty(&dev->work_node)) {
 		list_add_tail(&dev->work_node, &netdev_work_list);
 		netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
 	}
-	dev->work_core_pending |= event;
+	dev->work_pending |= events;
+	dev->work_core_pending |= core;
 	spin_unlock_bh(&netdev_work_lock);
 
 	schedule_work(&netdev_work);
 }
 
+static unsigned long
+netdev_work_dequeue(struct net_device *dev, unsigned long *pending,
+		    unsigned long mask)
+{
+	unsigned long events;
+
+	spin_lock_bh(&netdev_work_lock);
+	events = *pending & mask;
+	*pending &= ~events;
+	if (!list_empty(&dev->work_node) &&
+	    !dev->work_pending && !dev->work_core_pending) {
+		list_del_init(&dev->work_node);
+		netdev_put(dev, &dev->work_tracker);
+	}
+	spin_unlock_bh(&netdev_work_lock);
+
+	return events;
+}
+
+void netdev_work_sched(struct net_device *dev, unsigned long events)
+{
+	netdev_work_enqueue(dev, events, 0);
+}
+EXPORT_SYMBOL(netdev_work_sched);
+
 /**
- * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * netdev_work_cancel() - cancel selected work for a netdev
  * @dev: net_device
  * @mask: events to cancel
  *
  * Clear @mask from the device's work pending mask. If no work is left pending
- * the device is dequeued.
+ * the device is dequeued and its ndo_work won't be called.
  *
  * No expectations on locking, but also no guarantees provided. If the caller
  * wants to touch @dev afterwards (e.g. call the work that got canceled)
@@ -50,21 +82,33 @@ void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
  * Returns: the subset of @mask that was actually pending, so the caller can run
  * those events inline.
  */
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask)
+{
+	return netdev_work_dequeue(dev, &dev->work_pending, mask);
+}
+EXPORT_SYMBOL(netdev_work_cancel);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long events)
+{
+	netdev_work_enqueue(dev, 0, events);
+}
+
 unsigned long
 __netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
 {
-	unsigned long event;
+	return netdev_work_dequeue(dev, &dev->work_core_pending, mask);
+}
 
-	spin_lock_bh(&netdev_work_lock);
-	event = dev->work_core_pending & mask;
-	dev->work_core_pending &= ~mask;
-	if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
-		list_del_init(&dev->work_node);
-		netdev_put(dev, &dev->work_tracker);
-	}
-	spin_unlock_bh(&netdev_work_lock);
+static void netdev_work_run(struct net_device *dev, unsigned long events,
+			    unsigned long core)
+{
+	if (!netif_device_present(dev))
+		return;
 
-	return event;
+	if (core & NETDEV_WORK_RX_MODE)
+		netif_rx_mode_run(dev);
+	if (events && dev->netdev_ops->ndo_work)
+		dev->netdev_ops->ndo_work(dev, events);
 }
 
 static void netdev_work_proc(struct work_struct *work)
@@ -72,9 +116,9 @@ static void netdev_work_proc(struct work_struct *work)
 	rtnl_lock();
 
 	while (true) {
+		unsigned long events = 0, core = 0;
 		netdevice_tracker tracker;
 		struct net_device *dev;
-		unsigned long core = 0;
 
 		spin_lock_bh(&netdev_work_lock);
 		if (list_empty(&netdev_work_list)) {
@@ -98,16 +142,17 @@ static void netdev_work_proc(struct work_struct *work)
 			list_del_init(&dev->work_node);
 			core = dev->work_core_pending;
 			dev->work_core_pending = 0;
+			events = dev->work_pending;
+			dev->work_pending = 0;
 			/* We took another ref above */
 			netdev_put(dev, &dev->work_tracker);
 
 			if (!dev_isalive(dev))
-				core = 0;
+				core = events = 0;
 		}
 		spin_unlock_bh(&netdev_work_lock);
 
-		if (core & NETDEV_WORK_RX_MODE)
-			netif_rx_mode_run(dev);
+		netdev_work_run(dev, events, core);
 		netdev_unlock_ops(dev);
 
 		netdev_put(dev, &tracker);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 3/4] vlan: defer real device state propagation to netdev_work
  2026-06-24 18:20 [PATCH net 0/4] net: avoid nested UP notifier events Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API Jakub Kicinski
@ 2026-06-24 18:20 ` Jakub Kicinski
  2026-06-24 18:20 ` [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device Jakub Kicinski
  3 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski,
	syzbot+09da62a8b78959ceb8bb, syzbot+cb67c392b0b8f0fd0fc1,
	syzbot+9bb8bd77f3966641f298

vlan_device_event() generates nested UP/DOWN, MTU and feature
change events. It executes an event for the VLAN device directly
from the notifier - while the locks of the lower device are held.

This causes deadlocks, for example:

  bond    (3) bond_update_speed_duplex(vlan)
    |           ^                v
  vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
    |           ^                v
  dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()

The dummy device is ops locked, vlan creates a nested event (2),
then bond wants to ask vlan for link state (3). bond uses the
"I'm already holding the instance lock" flavor of API. But in
this case the lock held refers to vlan itself. We hit vlan's
link settings trampoline (4) and call __ethtool_get_link_ksettings()
which tries to lock dummy. Deadlock. There's no clean way for us
to tell the vlan_ethtool_get_link_ksettings() that the caller
is already in lower device's critical section.

Defer the propagation to the per-netdev work facility instead:
the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*),
and ndo_work (vlan_dev_work) applies the change later. Hopefully
nobody expects the VLAN state changes to be instantaneous.

If someone does expect the changes to be instantaneous we will
have to do the same thing Stan did for rx_mode and "strategically"
place sync calls, to make sure such delayed works are executed
after we drop the ops lock but before we drop rtnl_lock.

Stan suggests that if we need that down the line we may
consider reshaping the mechanism into "async notifications".
AFAICT only vlan does this sort of netdev open chaining,
so as a first try I think that sticking the complexity into
the vlan code makes sense.

One corner case is that we need to cancel the event if user
explicitly changes the state before work could run. Consider
the following operations with vlan0 on top of dummy0:

  ip link set dev dummy0 up    # queues work to up vlan0
  ip link set dev vlan0 down   # user explicitly downs the vlan
  ndo_work                     # acts on the stale event

Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com
Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com
Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com
Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/networking/netdevices.rst |  2 +
 net/8021q/vlan.h                        | 11 ++++
 net/8021q/vlan.c                        | 76 +++----------------------
 net/8021q/vlan_dev.c                    | 60 +++++++++++++++++++
 net/core/dev.c                          |  1 +
 5 files changed, 82 insertions(+), 68 deletions(-)

diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index fde601acd1d2..d2a238f8cc8b 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -433,6 +433,8 @@ exceptions) notifiers run under the instance lock. Please extend this
 documentation whenever you make explicit assumption about lock being held
 from a notifier.
 
+Drivers **must not** generate nested notifications of the ops-locked types.
+
 NETDEV_INTERNAL symbol namespace
 ================================
 
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index c7ffe591d593..c41caaf94095 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -125,6 +125,17 @@ static inline netdev_features_t vlan_tnl_features(struct net_device *real_dev)
 int vlan_filter_push_vids(struct vlan_info *vlan_info, __be16 proto);
 void vlan_filter_drop_vids(struct vlan_info *vlan_info, __be16 proto);
 
+/* netdev_work events propagated from the real device, see vlan_dev_work(). */
+enum {
+	VLAN_WORK_LINK_STATE	= BIT(0), /* sync up/down with real_dev */
+	VLAN_WORK_MTU		= BIT(1), /* clamp mtu to real_dev's */
+	VLAN_WORK_FEATURES	= BIT(2), /* re-inherit real_dev features */
+};
+
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+				     struct net_device *dev,
+				     struct vlan_dev_priv *vlan);
+
 /* found in vlan_dev.c */
 void vlan_dev_set_ingress_priority(const struct net_device *dev,
 				   u32 skb_prio, u16 vlan_prio);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 2b74ed56eb16..2d2efb877975 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -77,9 +77,9 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 	return 0;
 }
 
-static void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
-					    struct net_device *dev,
-					    struct vlan_dev_priv *vlan)
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+				     struct net_device *dev,
+				     struct vlan_dev_priv *vlan)
 {
 	if (!(vlan->flags & VLAN_FLAG_BRIDGE_BINDING))
 		netif_stacked_transfer_operstate(rootdev, dev);
@@ -316,29 +316,6 @@ static void vlan_sync_address(struct net_device *dev,
 	ether_addr_copy(vlan->real_dev_addr, dev->dev_addr);
 }
 
-static void vlan_transfer_features(struct net_device *dev,
-				   struct net_device *vlandev)
-{
-	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
-
-	netif_inherit_tso_max(vlandev, dev);
-
-	if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
-		vlandev->hard_header_len = dev->hard_header_len;
-	else
-		vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
-
-#if IS_ENABLED(CONFIG_FCOE)
-	vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
-#endif
-
-	vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
-	vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
-	vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
-
-	netdev_update_features(vlandev);
-}
-
 static int __vlan_device_event(struct net_device *dev, unsigned long event)
 {
 	int err = 0;
@@ -391,13 +368,11 @@ static void vlan_vid0_del(struct net_device *dev)
 static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 			     void *ptr)
 {
-	struct netlink_ext_ack *extack = netdev_notifier_info_to_extack(ptr);
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct vlan_group *grp;
 	struct vlan_info *vlan_info;
 	int i, flgs;
 	struct net_device *vlandev;
-	struct vlan_dev_priv *vlan;
 	bool last = false;
 	LIST_HEAD(list);
 	int err;
@@ -447,54 +422,19 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 			if (vlandev->mtu <= dev->mtu)
 				continue;
 
-			dev_set_mtu(vlandev, dev->mtu);
+			netdev_work_sched(vlandev, VLAN_WORK_MTU);
 		}
 		break;
 
 	case NETDEV_FEAT_CHANGE:
-		/* Propagate device features to underlying device */
 		vlan_group_for_each_dev(grp, i, vlandev)
-			vlan_transfer_features(dev, vlandev);
+			netdev_work_sched(vlandev, VLAN_WORK_FEATURES);
 		break;
 
-	case NETDEV_DOWN: {
-		struct net_device *tmp;
-		LIST_HEAD(close_list);
-
-		/* Put all VLANs for this dev in the down state too.  */
-		vlan_group_for_each_dev(grp, i, vlandev) {
-			flgs = vlandev->flags;
-			if (!(flgs & IFF_UP))
-				continue;
-
-			vlan = vlan_dev_priv(vlandev);
-			if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
-				list_add(&vlandev->close_list, &close_list);
-		}
-
-		netif_close_many(&close_list, false);
-
-		list_for_each_entry_safe(vlandev, tmp, &close_list, close_list) {
-			vlan_stacked_transfer_operstate(dev, vlandev,
-							vlan_dev_priv(vlandev));
-			list_del_init(&vlandev->close_list);
-		}
-		list_del(&close_list);
-		break;
-	}
+	case NETDEV_DOWN:
 	case NETDEV_UP:
-		/* Put all VLANs for this dev in the up state too.  */
-		vlan_group_for_each_dev(grp, i, vlandev) {
-			flgs = netif_get_flags(vlandev);
-			if (flgs & IFF_UP)
-				continue;
-
-			vlan = vlan_dev_priv(vlandev);
-			if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
-				dev_change_flags(vlandev, flgs | IFF_UP,
-						 extack);
-			vlan_stacked_transfer_operstate(dev, vlandev, vlan);
-		}
+		vlan_group_for_each_dev(grp, i, vlandev)
+			netdev_work_sched(vlandev, VLAN_WORK_LINK_STATE);
 		break;
 
 	case NETDEV_UNREGISTER:
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 7aa3af8b10ea..ec2569b3f8da 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -270,6 +270,9 @@ static int vlan_dev_open(struct net_device *dev)
 	    !(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
 		return -ENETDOWN;
 
+	/* The explicit open supersedes any deferred link-state sync */
+	netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
 	if (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr) &&
 	    !vlan_dev_inherit_address(dev, real_dev)) {
 		err = dev_uc_add(real_dev, dev->dev_addr);
@@ -300,6 +303,9 @@ static int vlan_dev_stop(struct net_device *dev)
 	struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
 	struct net_device *real_dev = vlan->real_dev;
 
+	/* The explicit close supersedes any deferred link-state sync */
+	netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
 	dev_mc_unsync(real_dev, dev);
 	dev_uc_unsync(real_dev, dev);
 
@@ -1016,6 +1022,59 @@ static const struct ethtool_ops vlan_ethtool_ops = {
 	.get_ts_info		= vlan_ethtool_get_ts_info,
 };
 
+static void vlan_transfer_features(struct net_device *dev,
+				   struct net_device *vlandev)
+{
+	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+
+	netif_inherit_tso_max(vlandev, dev);
+
+	if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
+		vlandev->hard_header_len = dev->hard_header_len;
+	else
+		vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
+
+#if IS_ENABLED(CONFIG_FCOE)
+	vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
+#endif
+
+	vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
+	vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
+
+	netdev_update_features(vlandev);
+}
+
+static void vlan_dev_work(struct net_device *vlandev, unsigned long events)
+{
+	struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+	struct net_device *real_dev = vlan->real_dev;
+	bool loose = vlan->flags & VLAN_FLAG_LOOSE_BINDING;
+	unsigned int flgs;
+
+	if (events & VLAN_WORK_LINK_STATE) {
+		flgs = netif_get_flags(vlandev);
+		if (real_dev->flags & IFF_UP) {
+			if (!(flgs & IFF_UP)) {
+				if (!loose)
+					netif_change_flags(vlandev,
+							   flgs | IFF_UP, NULL);
+				vlan_stacked_transfer_operstate(real_dev,
+								vlandev, vlan);
+			}
+		} else if ((flgs & IFF_UP) && !loose) {
+			netif_change_flags(vlandev, flgs & ~IFF_UP, NULL);
+			vlan_stacked_transfer_operstate(real_dev, vlandev, vlan);
+		}
+	}
+
+	if ((events & VLAN_WORK_MTU) && vlandev->mtu > real_dev->mtu)
+		netif_set_mtu(vlandev, real_dev->mtu);
+
+	if (events & VLAN_WORK_FEATURES)
+		vlan_transfer_features(real_dev, vlandev);
+}
+
 static const struct net_device_ops vlan_netdev_ops = {
 	.ndo_change_mtu		= vlan_dev_change_mtu,
 	.ndo_init		= vlan_dev_init,
@@ -1027,6 +1086,7 @@ static const struct net_device_ops vlan_netdev_ops = {
 	.ndo_set_mac_address	= vlan_dev_set_mac_address,
 	.ndo_set_rx_mode	= vlan_dev_set_rx_mode,
 	.ndo_change_rx_flags	= vlan_dev_change_rx_flags,
+	.ndo_work		= vlan_dev_work,
 	.ndo_eth_ioctl		= vlan_dev_ioctl,
 	.ndo_neigh_setup	= vlan_dev_neigh_setup,
 	.ndo_get_stats64	= vlan_dev_get_stats64,
diff --git a/net/core/dev.c b/net/core/dev.c
index e1d8af0ef6ab..4b3d5cfdf6e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9822,6 +9822,7 @@ int netif_change_flags(struct net_device *dev, unsigned int flags,
 	__dev_notify_flags(dev, old_flags, changes, 0, NULL);
 	return ret;
 }
+EXPORT_SYMBOL(netif_change_flags);
 
 int __netif_set_mtu(struct net_device *dev, int new_mtu)
 {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device
  2026-06-24 18:20 [PATCH net 0/4] net: avoid nested UP notifier events Jakub Kicinski
                   ` (2 preceding siblings ...)
  2026-06-24 18:20 ` [PATCH net 3/4] vlan: defer real device state propagation to netdev_work Jakub Kicinski
@ 2026-06-24 18:20 ` Jakub Kicinski
  3 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, Jakub Kicinski

Add a regression test for the VLAN notifier handling that the netdev_work
deferral fixed.

A VLAN's real device propagates its UP/DOWN, MTU and feature changes onto
the VLANs stacked on top of it. This used to be done synchronously from the
real device's notifier and deadlocked when the real device was brought up
while enslaved to a bond (instance lock held across NETDEV_UP) and the VLAN
on top was itself a bond member: the synchronous propagation re-entered the
stack and took the same instance lock again.

The test covers both halves:
 - that the deferred UP/DOWN, MTU and feature propagation actually lands on
   the VLAN (link state and MTU use an ops-locked dummy, i.e. the deferral
   path; features use veth, which exports vlan_features to inherit), and
 - that the deadlock-prone topology - a VLAN on a dummy, with the VLAN and
   the dummy each enslaved to a different bond - can be built without
   hanging.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 .../selftests/drivers/net/bonding/Makefile    |   1 +
 .../drivers/net/bonding/bond_vlan_real_dev.sh | 180 ++++++++++++++++++
 2 files changed, 181 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh

diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile
index be130bf585a4..6364ca02642d 100644
--- a/tools/testing/selftests/drivers/net/bonding/Makefile
+++ b/tools/testing/selftests/drivers/net/bonding/Makefile
@@ -13,6 +13,7 @@ TEST_PROGS := \
 	bond_options.sh \
 	bond_passive_lacp.sh \
 	bond_stacked_header_parse.sh \
+	bond_vlan_real_dev.sh \
 	dev_addr_lists.sh \
 	mode-1-recovery-updelay.sh \
 	mode-2-recovery-updelay.sh \
diff --git a/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
new file mode 100755
index 000000000000..542d9ffc4819
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
@@ -0,0 +1,180 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test propagation of a real device's state to the VLANs stacked on top of it
+# when the real device is (or becomes) a bond member.
+#
+# The kernel mirrors a real device's UP/DOWN, MTU and feature changes onto its
+# VLANs.  This is done asynchronously (netdev_work): doing it synchronously from
+# the real device's notifier could deadlock.  If the real device is brought up
+# while enslaved to a bond - so its instance lock is held across NETDEV_UP - and
+# a VLAN on top of it is itself a bond member, the synchronous propagation
+# re-entered the stack and tried to take the same instance lock again.
+#
+# Cover both halves:
+#  - the deferred UP/DOWN, MTU and feature propagation actually lands on the
+#    VLAN (link state and MTU use an ops-locked dummy, i.e. the deferral path),
+#  - the deadlock-prone topology - a VLAN on a dummy, with the VLAN and the
+#    dummy each enslaved to a different bond - can be built without hanging.
+
+ALL_TESTS="
+	vlan_link_state
+	vlan_mtu
+	vlan_features
+	vlan_real_dev_enslave
+"
+
+REQUIRE_MZ=no
+NUM_NETIFS=0
+lib_dir=$(dirname "$0")
+source "$lib_dir"/../../../net/forwarding/lib.sh
+
+# Return 0 if $dev in netns $ns has flag $flag set (e.g. UP) in its <...> flags.
+link_has_flag()
+{
+	local ns=$1 dev=$2 flag=$3
+
+	ip -n "$ns" link show dev "$dev" 2>/dev/null | grep -q "[<,]${flag}[,>]"
+}
+
+link_lacks_flag()
+{
+	! link_has_flag "$@"
+}
+
+link_mtu_is()
+{
+	local ns=$1 dev=$2 want=$3 cur
+
+	cur=$(ip -n "$ns" link show dev "$dev" 2>/dev/null | \
+		sed -n 's/.* mtu \([0-9]\+\).*/\1/p')
+	[ "$cur" = "$want" ]
+}
+
+vlan_feature_is()
+{
+	local ns=$1 dev=$2 feature=$3 value=$4
+
+	ip netns exec "$ns" ethtool -k "$dev" 2>/dev/null | \
+		grep -q "^$feature: $value"
+}
+
+link_has_master()
+{
+	local ns=$1 dev=$2 master=$3
+
+	ip -n "$ns" -o link show dev "$dev" 2>/dev/null | grep -q "master $master"
+}
+
+vlan_link_state()
+{
+	RET=0
+
+	ip -n "$NS" link add ls_dummy type dummy
+	ip -n "$NS" link add link ls_dummy name ls_vlan type vlan id 100
+
+	# Bringing the real device up must propagate UP to the VLAN.
+	ip -n "$NS" link set ls_dummy up
+	busywait "$BUSYWAIT_TIMEOUT" link_has_flag "$NS" ls_vlan UP
+	check_err $? "VLAN did not go UP after the real device went UP"
+
+	# ... and likewise for DOWN.
+	ip -n "$NS" link set ls_dummy down
+	busywait "$BUSYWAIT_TIMEOUT" link_lacks_flag "$NS" ls_vlan UP
+	check_err $? "VLAN did not go DOWN after the real device went DOWN"
+
+	ip -n "$NS" link del ls_vlan
+	ip -n "$NS" link del ls_dummy
+
+	log_test "VLAN link state follows the real device"
+}
+
+vlan_mtu()
+{
+	RET=0
+
+	# The VLAN inherits the real device's MTU (2000) at creation time.
+	ip -n "$NS" link add mtu_dummy mtu 2000 type dummy
+	ip -n "$NS" link add link mtu_dummy name mtu_vlan type vlan id 100
+
+	# Shrinking the real device's MTU must clamp the VLAN's MTU.
+	ip -n "$NS" link set mtu_dummy mtu 1500
+	busywait "$BUSYWAIT_TIMEOUT" link_mtu_is "$NS" mtu_vlan 1500
+	check_err $? "VLAN MTU not clamped after the real device's MTU shrank"
+
+	ip -n "$NS" link del mtu_vlan
+	ip -n "$NS" link del mtu_dummy
+
+	log_test "VLAN MTU clamped to the real device"
+}
+
+vlan_features()
+{
+	RET=0
+
+	# Use veth as the real device: unlike dummy it exports vlan_features, so
+	# the VLAN actually inherits a toggleable offload to assert on.
+	ip -n "$NS" link add ft_veth type veth peer name ft_veth_pr
+	ip -n "$NS" link add link ft_veth name ft_vlan type vlan id 100
+
+	vlan_feature_is "$NS" ft_vlan scatter-gather on
+	check_err $? "VLAN did not inherit scatter-gather from the real device"
+
+	# Toggling the offload on the real device must propagate to the VLAN.
+	ip netns exec "$NS" ethtool -K ft_veth sg off
+	busywait "$BUSYWAIT_TIMEOUT" \
+		vlan_feature_is "$NS" ft_vlan scatter-gather off
+	check_err $? "VLAN scatter-gather still on after disabling it on real dev"
+
+	ip netns exec "$NS" ethtool -K ft_veth sg on
+	busywait "$BUSYWAIT_TIMEOUT" \
+		vlan_feature_is "$NS" ft_vlan scatter-gather on
+	check_err $? "VLAN scatter-gather still off after enabling it on real dev"
+
+	ip -n "$NS" link del ft_vlan
+	ip -n "$NS" link del ft_veth
+
+	log_test "VLAN features follow the real device"
+}
+
+vlan_real_dev_enslave()
+{
+	RET=0
+
+	# dummy <- VLAN -> bond0, then enslave the dummy itself to bond1.  The
+	# last step brings the dummy up under bond1's instance lock, which used
+	# to deadlock while synchronously propagating UP to the (bond-enslaved)
+	# VLAN on top.
+	ip -n "$NS" link add dl_dummy type dummy
+	ip -n "$NS" link set dl_dummy up
+	ip -n "$NS" link add link dl_dummy name dl_vlan type vlan id 100
+
+	ip -n "$NS" link add dl_bond0 type bond mode active-backup
+	ip -n "$NS" link set dl_vlan down
+	ip -n "$NS" link set dl_vlan master dl_bond0
+	check_err $? "could not enslave the VLAN to bond0"
+
+	ip -n "$NS" link add dl_bond1 type bond mode active-backup
+	ip -n "$NS" link set dl_dummy down
+	ip -n "$NS" link set dl_dummy master dl_bond1
+	check_err $? "could not enslave the real device to bond1"
+
+	# If we got here the kernel did not deadlock; make sure it is still
+	# responsive and the enslave really took effect.
+	link_has_master "$NS" dl_dummy dl_bond1
+	check_err $? "real device not enslaved to bond1"
+
+	ip -n "$NS" link del dl_bond1
+	ip -n "$NS" link del dl_bond0
+	ip -n "$NS" link del dl_vlan
+	ip -n "$NS" link del dl_dummy
+
+	log_test "VLAN real device enslaved to a second bond"
+}
+
+setup_ns NS
+trap 'cleanup_ns $NS' EXIT
+
+tests_run
+
+exit "$EXIT_STATUS"
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-24 18:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 18:20 [PATCH net 0/4] net: avoid nested UP notifier events Jakub Kicinski
2026-06-24 18:20 ` [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility Jakub Kicinski
2026-06-24 18:20 ` [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API Jakub Kicinski
2026-06-24 18:20 ` [PATCH net 3/4] vlan: defer real device state propagation to netdev_work Jakub Kicinski
2026-06-24 18:20 ` [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox