[PATCH net-next v5 0/2] net: Split ndo_set_rx

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot
@ 2025-11-20 14:13 I Viswanath
  2025-11-20 14:13 ` [PATCH net-next v5 1/2] net: refactor set_rx_mode into snapshot and deferred I/O I Viswanath
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: I Viswanath @ 2025-11-20 14:13 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, horms, sdf, kuniyu,
	skhawaja, aleksander.lobakin, mst, jasowang, xuanzhuo, eperezma
  Cc: virtualization, netdev, linux-kernel, linux-kernel-mentees,
	I Viswanath

This is an implementation of the idea provided by Jakub here

https://lore.kernel.org/netdev/20250923163727.5e97abdb@kernel.org/

ndo_set_rx_mode is problematic because it cannot sleep. 

To address this, this series proposes dividing the concept of setting
rx_mode into 2 stages: snapshot and deferred I/O. To achieve this, we
reinterpret set_rx_mode and add create a new ndo write_rx_mode as
explained below:

The new set_rx_mode will be responsible for customizing the rx_mode
snapshot which will be used by write_rx_mode to update the hardware

In brief, the new flow looks something like:

prepare_rx_mode():
    ndo_set_rx_mode();
    ready_snapshot();

write_rx_mode():
    commit_and_use_snapshot();
    ndo_write_rx_mode();

write_rx_mode() is called from a work item and doesn't hold the 
netif_addr_lock lock during ndo_write_rx_mode() making it sleepable
in that section.

This model should work correctly if the following conditions hold:

1. write_rx_mode should use the rx_mode set by the most recent
    call to prepare_rx_mode before its execution.

2. If a prepare_rx_mode call happens during execution of write_rx_mode,
    write_rx_mode should be rescheduled.

3. All calls to modify rx_mode should pass through the prepare_rx_mode +
    schedule write_rx_mode execution flow. netif_rx_mode_schedule_work 
    has been implemented in core for this.

1 and 2 are guaranteed because of the properties of work queues

Drivers need to ensure 3

To use this model, a driver needs to implement the
ndo_write_rx_mode callback, change the set_rx_mode callback
appropriately and replace all calls to modify rx mode with
netif_rx_mode_schedule_work

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
---
Questions I have:

1) Would there ever be a situation in which you will have to wait for I/O 
to complete in a call to set_rx_mode before proceeding further? That is, 
Does netif_rx_mode_schedule_work need the flush argument?

2) Does priv_ptr in netif_rx_mode_config make sense? For virtio_net, I 
can get the vi pointer by doing netdev_priv(dev) and am wondering 
if this would be a common thing

3) From a previous discussion: 
https://lore.kernel.org/netdev/417c677f-268a-4163-b07e-deea8f9b9b40@intel.com/

On Thu, 23 Oct 2025 at 05:16, Jacob Keller  wrote:
> Is there any mechanism to make this guarantee either implemented or at
> least verified by the core? If not that, what about some sort of way to
> lint driver code and make sure its correct?

I am not sure how to automate this but basically we need warnings to be 
generated when the the set_rx_mode implementations are called statically in 
code (From my understanding, usually in the open callback or the timeout function) 
but not when they are called through ops->set_rx_mode. 
Can Coccinelle do something like this?

v1:
Link: https://lore.kernel.org/netdev/20251020134857.5820-1-viswanathiyyappan@gmail.com/

v2:
- Exported set_and_schedule_rx_config as a symbol for use in modules
- Fixed incorrect cleanup for the case of rx_work alloc failing in alloc_netdev_mqs
- Removed the locked version (cp_set_rx_mode) and renamed __cp_set_rx_mode to cp_set_rx_mode
Link: https://lore.kernel.org/netdev/20251026175445.1519537-1-viswanathiyyappan@gmail.com/

v3:
- Added RFT tag
- Corrected mangled patch
Link: https://lore.kernel.org/netdev/20251028174222.1739954-1-viswanathiyyappan@gmail.com/

v4:
- Completely reworked the snapshot mechanism as per v3 comments
- Implemented the callback for virtio-net instead of 8139cp driver
- Removed RFC tag
Link: https://lore.kernel.org/netdev/20251118164333.24842-1-viswanathiyyappan@gmail.com/

v5:
- Fix broken code and titles
- Remove RFT tag

Here’s an enumeration of all the cases I have tested that should be exhaustive:

RX behaviour verification:

1) Dest is UC/MC addr X in UC/MC list:
	no mode: Recv
	allmulti: Recv
	promisc: Recv	

2) Dest is UC addr X not in UC list:
	no_mode: Drop
	allmulti: Drop
	promisc: Recv

3) Dest is MC addr X not in MC list:
	no_mode: Drop
	allmulti: Recv
	promisc: Recv

Packets injected from host using scapy on a TAP device as follows:
sendp(Ether(src=tap0_mac, dst=X) / IP() / UDP() / "test", iface="tap0")

And on the VM side, rx was checked via cat /proc/net/dev

Teardown path:

Relevant as they flush the work item. ens4 is the virtio-net interface.

virtnet_remove:
ip maddr add 01:00:5e:00:03:02 dev ens4; echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove

virtnet_freeze_down:
ip maddr add 01:00:5e:00:03:02 dev ens4; echo mem > /sys/power/state

---

I Viswanath (2):
  net: refactor set_rx_mode into snapshot and deferred I/O
  virtio-net: Implement ndo_write_rx_mode callback

 drivers/net/virtio_net.c  |  58 +++++------
 include/linux/netdevice.h | 104 ++++++++++++++++++-
 net/core/dev.c            | 208 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 330 insertions(+), 40 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next v5 1/2] net: refactor set_rx_mode into snapshot and deferred I/O
  2025-11-20 14:13 [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot I Viswanath
@ 2025-11-20 14:13 ` I Viswanath
  2025-11-20 14:13 ` [PATCH net-next v5 2/2] virtio-net: Implement ndo_write_rx_mode callback I Viswanath
  2025-11-20 15:17 ` [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot Jakub Kicinski
  2 siblings, 0 replies; 5+ messages in thread
From: I Viswanath @ 2025-11-20 14:13 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, horms, sdf, kuniyu,
	skhawaja, aleksander.lobakin, mst, jasowang, xuanzhuo, eperezma
  Cc: virtualization, netdev, linux-kernel, linux-kernel-mentees,
	I Viswanath

ndo_set_rx_mode is problematic as it cannot sleep.

There are drivers that circumvent this by doing the rx_mode work
in a work item. This requires extra work that can be avoided if
core provided a mechanism to do that. This patch proposes such a
mechanism.

Refactor set_rx_mode into 2 stages: A snapshot stage and the
actual I/O. In this new model, when _dev_set_rx_mode is called,
we take a snapshot of the current rx_config and then commit it
to the hardware later via a work item

To accomplish this, reinterpret set_rx_mode as the ndo for
customizing the snapshot and enabling/disabling rx_mode set
and add a new ndo write_rx_mode for the deferred I/O

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
---
 include/linux/netdevice.h | 104 ++++++++++++++++++-
 net/core/dev.c            | 208 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 305 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e808071dbb7d..e819426bb7cb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1049,6 +1049,40 @@ struct netdev_net_notifier {
 	struct notifier_block *nb;
 };
 
+enum netif_rx_mode_flags {
+	/* enable flags */
+	NETIF_RX_MODE_ALLMULTI_EN,
+	NETIF_RX_MODE_PROM_EN,
+	NETIF_RX_MODE_VLAN_EN,
+
+	/* control flags */
+	/* pending config state */
+	NETIF_RX_MODE_CFG_READY,
+
+	/* if set, rx_mode config work will not be executed */
+	NETIF_RX_MODE_SET_DIS,
+
+	/* if set, uc/mc lists will not be part of rx_mode config */
+	NETIF_RX_MODE_UC_SKIP,
+	NETIF_RX_MODE_MC_SKIP
+};
+
+struct netif_rx_mode_config {
+	char	*uc_addrs;
+	char	*mc_addrs;
+	int	uc_count;
+	int	mc_count;
+	int	ctrl_flags;
+	void	*priv_ptr;
+};
+
+struct netif_rx_mode_ctx {
+	struct work_struct		rx_mode_work;
+	struct net_device		*dev;
+	struct netif_rx_mode_config	*ready;
+	struct netif_rx_mode_config	*pending;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1101,9 +1135,14 @@ struct netdev_net_notifier {
  *	changes to configuration when multicast or promiscuous is enabled.
  *
  * void (*ndo_set_rx_mode)(struct net_device *dev);
- *	This function is called device changes address list filtering.
+ *	This function is called when device changes address list filtering.
  *	If driver handles unicast address filtering, it should set
- *	IFF_UNICAST_FLT in its priv_flags.
+ *	IFF_UNICAST_FLT in its priv_flags. This is used to configure
+ *	the rx_mode snapshot that will be written to the hardware.
+ *
+ * void (*ndo_write_rx_mode)(struct net_device *dev);
+ *	This function is scheduled after set_rx_mode and is responsible for
+ *	writing the rx_mode snapshot to the hardware.
  *
  * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
  *	This function  is called when the Media Access Control address
@@ -1424,6 +1463,7 @@ struct net_device_ops {
 	void			(*ndo_change_rx_flags)(struct net_device *dev,
 						       int flags);
 	void			(*ndo_set_rx_mode)(struct net_device *dev);
+	void			(*ndo_write_rx_mode)(struct net_device *dev);
 	int			(*ndo_set_mac_address)(struct net_device *dev,
 						       void *addr);
 	int			(*ndo_validate_addr)(struct net_device *dev);
@@ -1926,7 +1966,7 @@ enum netdev_reg_state {
  *	@ingress_queue:		XXX: need comments on this one
  *	@nf_hooks_ingress:	netfilter hooks executed for ingress packets
  *	@broadcast:		hw bcast address
- *
+ *	@rx_mode_ctx:		context required for rx_mode config work
  *	@rx_cpu_rmap:	CPU reverse-mapping for RX completion interrupts,
  *			indexed by RX queue number. Assigned by driver.
  *			This must only be set if the ndo_rx_flow_steer
@@ -2337,6 +2377,7 @@ struct net_device {
 #endif
 
 	unsigned char		broadcast[MAX_ADDR_LEN];
+	struct netif_rx_mode_ctx *rx_mode_ctx;
 #ifdef CONFIG_RFS_ACCEL
 	struct cpu_rmap		*rx_cpu_rmap;
 #endif
@@ -3360,6 +3401,63 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
 		     struct net_device *sb_dev);
 
+void netif_rx_mode_schedule_work(struct net_device *dev, bool flush);
+
+/* Drivers that implement rx mode as work flush the work item when closing
+ * or suspending. This is the substitute for those calls.
+ */
+static inline void netif_rx_mode_flush_work(struct net_device *dev)
+{
+	flush_work(&dev->rx_mode_ctx->rx_mode_work);
+}
+
+/* Helpers to be used in the set_rx_mode implementation */
+static inline void netif_rx_mode_set_bit(struct net_device *dev, int b,
+					 bool val)
+{
+	if (val)
+		dev->rx_mode_ctx->pending->ctrl_flags |= BIT(b);
+	else
+		dev->rx_mode_ctx->pending->ctrl_flags &= ~BIT(b);
+}
+
+static inline void netif_rx_mode_set_priv_ptr(struct net_device *dev,
+					      void *priv)
+{
+	dev->rx_mode_ctx->pending->priv_ptr = priv;
+}
+
+/* Helpers to be used in the write_rx_mode implementation */
+static inline bool netif_rx_mode_get_bit(struct net_device *dev, int b)
+{
+	return !!(dev->rx_mode_ctx->ready->ctrl_flags & BIT(b));
+}
+
+static inline void *netif_rx_mode_get_priv_ptr(struct net_device *dev)
+{
+	return dev->rx_mode_ctx->ready->priv_ptr;
+}
+
+static inline int netif_rx_mode_get_mc_count(struct net_device *dev)
+{
+	return dev->rx_mode_ctx->ready->mc_count;
+}
+
+static inline int netif_rx_mode_get_uc_count(struct net_device *dev)
+{
+	return dev->rx_mode_ctx->ready->uc_count;
+}
+
+#define netif_rx_mode_for_each_uc_addr(dev, ha_addr, idx) \
+	for (ha_addr = (dev)->rx_mode_ctx->ready->uc_addrs, idx = 0; \
+		idx < (dev)->rx_mode_ctx->ready->uc_count; \
+		ha_addr += (dev)->addr_len, idx++)
+
+#define netif_rx_mode_for_each_mc_addr(dev, ha_addr, idx) \
+	for (ha_addr = (dev)->rx_mode_ctx->ready->mc_addrs, idx = 0; \
+		idx < (dev)->rx_mode_ctx->ready->mc_count; \
+		ha_addr += (dev)->addr_len, idx++)
+
 int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev);
 int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 69515edd17bc..2be3ff8512b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1645,6 +1645,160 @@ static int napi_kthread_create(struct napi_struct *n)
 	return err;
 }
 
+/* The existence of pending/ready config is an implementation detail. The
+ * caller shouldn't be aware of them. This is a bit hacky. We read
+ * bits from pending because control bits need to be read before pending
+ * is prepared.
+ */
+static bool __netif_rx_mode_pending_get_bit(struct net_device *dev, int b)
+{
+	return !!(dev->rx_mode_ctx->pending->ctrl_flags & BIT(b));
+}
+
+/* This function attempts to copy the current state of the
+ * net device into pending (reallocating if necessary). If it fails,
+ * pending is guaranteed to be unmodified.
+ */
+static int netif_rx_mode_alloc_and_fill_pending(struct net_device *dev)
+{
+	struct netif_rx_mode_config *pending = dev->rx_mode_ctx->pending;
+	int uc_count = 0, mc_count = 0;
+	struct netdev_hw_addr *ha;
+	char *tmp;
+	int i;
+
+	/* The allocations need to be atomic since this will be called under
+	 * netif_addr_lock_bh()
+	 */
+	if (!__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_UC_SKIP)) {
+		uc_count = netdev_uc_count(dev);
+		tmp = krealloc(pending->uc_addrs,
+			       uc_count * dev->addr_len,
+			       GFP_ATOMIC);
+		if (!tmp)
+			return -ENOMEM;
+		pending->uc_addrs = tmp;
+	}
+
+	if (!__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_MC_SKIP)) {
+		mc_count = netdev_mc_count(dev);
+		tmp = krealloc(pending->mc_addrs,
+			       mc_count * dev->addr_len,
+			       GFP_ATOMIC);
+		if (!tmp)
+			return -ENOMEM;
+		pending->mc_addrs = tmp;
+	}
+
+	/* This function cannot fail after this point */
+
+	/* This is going to be the same for every single driver. Better to
+	 * do it here than in the set_rx_mode impl
+	 */
+	netif_rx_mode_set_bit(dev, NETIF_RX_MODE_ALLMULTI_EN,
+			      !!(dev->flags & IFF_ALLMULTI));
+
+	netif_rx_mode_set_bit(dev, NETIF_RX_MODE_PROM_EN,
+			      !!(dev->flags & IFF_PROMISC));
+
+	i = 0;
+	if (!__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_UC_SKIP)) {
+		pending->uc_count = uc_count;
+		netdev_for_each_uc_addr(ha, dev)
+			memcpy(pending->uc_addrs + (i++) * dev->addr_len,
+			       ha->addr,
+			       dev->addr_len);
+	}
+
+	i = 0;
+	if (!__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_MC_SKIP)) {
+		pending->mc_count = mc_count;
+		netdev_for_each_mc_addr(ha, dev)
+			memcpy(pending->mc_addrs + (i++) * dev->addr_len,
+			       ha->addr,
+			       dev->addr_len);
+	}
+	return 0;
+}
+
+static void netif_rx_mode_prepare_pending(struct net_device *dev)
+{
+	lockdep_assert_held(&dev->addr_list_lock);
+	int rc;
+
+	rc = netif_rx_mode_alloc_and_fill_pending(dev);
+	if (rc)
+		return;
+
+	netif_rx_mode_set_bit(dev, NETIF_RX_MODE_CFG_READY, true);
+}
+
+static void netif_rx_mode_write_active(struct work_struct *param)
+{
+	struct netif_rx_mode_ctx *rx_mode_ctx = container_of(param,
+			struct netif_rx_mode_ctx, rx_mode_work);
+
+	struct net_device *dev = rx_mode_ctx->dev;
+
+	/* Paranoia. */
+	WARN_ON(!dev->netdev_ops->ndo_write_rx_mode);
+
+	/* We could introduce a new lock for this but reusing the addr
+	 * lock works well enough
+	 */
+	netif_addr_lock_bh(dev);
+
+	/* There's no point continuing if the pending config is not ready */
+	if (!__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_CFG_READY)) {
+		netif_addr_unlock_bh(dev);
+		return;
+	}
+
+	/* We use the prepared pending config as the new ready config and
+	 * reuse old ready config's memory for the next pending config
+	 */
+	swap(rx_mode_ctx->ready, rx_mode_ctx->pending);
+	netif_rx_mode_set_bit(dev, NETIF_RX_MODE_CFG_READY, false);
+
+	netif_addr_unlock_bh(dev);
+
+	rtnl_lock();
+	dev->netdev_ops->ndo_write_rx_mode(dev);
+	rtnl_unlock();
+}
+
+static int alloc_rx_mode_ctx(struct net_device *dev)
+{
+	dev->rx_mode_ctx = kzalloc(sizeof(*dev->rx_mode_ctx), GFP_KERNEL);
+
+	if (!dev->rx_mode_ctx)
+		goto fail;
+
+	dev->rx_mode_ctx->ready = kzalloc(sizeof(*dev->rx_mode_ctx->ready),
+					  GFP_KERNEL);
+
+	if (!dev->rx_mode_ctx->ready)
+		goto fail_ready;
+
+	dev->rx_mode_ctx->pending = kzalloc(sizeof(*dev->rx_mode_ctx->pending),
+					    GFP_KERNEL);
+
+	if (!dev->rx_mode_ctx->pending)
+		goto fail_pending;
+
+	INIT_WORK(&dev->rx_mode_ctx->rx_mode_work, netif_rx_mode_write_active);
+	dev->rx_mode_ctx->dev = dev;
+
+	return 0;
+
+fail_pending:
+	kfree(dev->rx_mode_ctx->ready);
+fail_ready:
+	kfree(dev->rx_mode_ctx);
+fail:
+	return -ENOMEM;
+}
+
 static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -1679,6 +1833,9 @@ static int __dev_open(struct net_device *dev, struct netlink_ext_ack *extack)
 	if (ops->ndo_validate_addr)
 		ret = ops->ndo_validate_addr(dev);
 
+	if (!ret && ops->ndo_write_rx_mode)
+		ret = alloc_rx_mode_ctx(dev);
+
 	if (!ret && ops->ndo_open)
 		ret = ops->ndo_open(dev);
 
@@ -1713,6 +1870,22 @@ int netif_open(struct net_device *dev, struct netlink_ext_ack *extack)
 	return ret;
 }
 
+static void cleanup_rx_mode_ctx(struct net_device *dev)
+{
+	/* cancel and wait for execution to complete */
+	cancel_work_sync(&dev->rx_mode_ctx->rx_mode_work);
+
+	kfree(dev->rx_mode_ctx->pending->uc_addrs);
+	kfree(dev->rx_mode_ctx->pending->mc_addrs);
+	kfree(dev->rx_mode_ctx->pending);
+
+	kfree(dev->rx_mode_ctx->ready->uc_addrs);
+	kfree(dev->rx_mode_ctx->ready->mc_addrs);
+	kfree(dev->rx_mode_ctx->ready);
+
+	kfree(dev->rx_mode_ctx);
+}
+
 static void __dev_close_many(struct list_head *head)
 {
 	struct net_device *dev;
@@ -1755,6 +1928,9 @@ static void __dev_close_many(struct list_head *head)
 		if (ops->ndo_stop)
 			ops->ndo_stop(dev);
 
+		if (ops->ndo_write_rx_mode)
+			cleanup_rx_mode_ctx(dev);
+
 		netif_set_up(dev, false);
 		netpoll_poll_enable(dev);
 	}
@@ -9613,6 +9789,33 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
 	return 0;
 }
 
+/* netif_rx_mode_schedule_work - Sets up the rx_config snapshot and
+ * schedules the deferred I/O. If it's necessary to wait for completion
+ * of I/O, set flush to true.
+ */
+void netif_rx_mode_schedule_work(struct net_device *dev, bool flush)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_set_rx_mode)
+		ops->ndo_set_rx_mode(dev);
+
+	/* Return early if ndo_write_rx_mode is not implemented */
+	if (!ops->ndo_write_rx_mode)
+		return;
+
+	/* If rx_mode config is disabled, we don't schedule the work */
+	if (__netif_rx_mode_pending_get_bit(dev, NETIF_RX_MODE_SET_DIS))
+		return;
+
+	netif_rx_mode_prepare_pending(dev);
+
+	schedule_work(&dev->rx_mode_ctx->rx_mode_work);
+	if (flush)
+		flush_work(&dev->rx_mode_ctx->rx_mode_work);
+}
+EXPORT_SYMBOL(netif_rx_mode_schedule_work);
+
 /*
  *	Upload unicast and multicast address lists to device and
  *	configure RX filtering. When the device doesn't support unicast
@@ -9621,8 +9824,6 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
  */
 void __dev_set_rx_mode(struct net_device *dev)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-
 	/* dev_open will call this function so the list will stay sane. */
 	if (!(dev->flags&IFF_UP))
 		return;
@@ -9643,8 +9844,7 @@ void __dev_set_rx_mode(struct net_device *dev)
 		}
 	}
 
-	if (ops->ndo_set_rx_mode)
-		ops->ndo_set_rx_mode(dev);
+	netif_rx_mode_schedule_work(dev, false);
 }
 
 void dev_set_rx_mode(struct net_device *dev)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next v5 2/2] virtio-net: Implement ndo_write_rx_mode callback
  2025-11-20 14:13 [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot I Viswanath
  2025-11-20 14:13 ` [PATCH net-next v5 1/2] net: refactor set_rx_mode into snapshot and deferred I/O I Viswanath
@ 2025-11-20 14:13 ` I Viswanath
  2025-11-20 15:17 ` [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot Jakub Kicinski
  2 siblings, 0 replies; 5+ messages in thread
From: I Viswanath @ 2025-11-20 14:13 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, horms, sdf, kuniyu,
	skhawaja, aleksander.lobakin, mst, jasowang, xuanzhuo, eperezma
  Cc: virtualization, netdev, linux-kernel, linux-kernel-mentees,
	I Viswanath

Implement ndo_write_rx_mode callback for virtio-net

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
---
 drivers/net/virtio_net.c | 58 +++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 33 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index cfa006b88688..02bf9bc970a0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -460,9 +460,6 @@ struct virtnet_info {
 	/* Work struct for config space updates */
 	struct work_struct config_work;
 
-	/* Work struct for setting rx mode */
-	struct work_struct rx_mode_work;
-
 	/* OK to queue work setting RX mode? */
 	bool rx_mode_work_enabled;
 
@@ -3857,33 +3854,31 @@ static int virtnet_close(struct net_device *dev)
 	return 0;
 }
 
-static void virtnet_rx_mode_work(struct work_struct *work)
+static void virtnet_write_rx_mode(struct net_device *dev)
 {
-	struct virtnet_info *vi =
-		container_of(work, struct virtnet_info, rx_mode_work);
+	struct virtnet_info *vi = netif_rx_mode_get_priv_ptr(dev);
 	u8 *promisc_allmulti  __free(kfree) = NULL;
-	struct net_device *dev = vi->dev;
 	struct scatterlist sg[2];
 	struct virtio_net_ctrl_mac *mac_data;
-	struct netdev_hw_addr *ha;
+	char *ha_addr;
 	int uc_count;
 	int mc_count;
 	void *buf;
+	int idx;
 	int i;
 
 	/* We can't dynamically set ndo_set_rx_mode, so return gracefully */
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
 		return;
 
-	promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_KERNEL);
+	promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_ATOMIC);
 	if (!promisc_allmulti) {
 		dev_warn(&dev->dev, "Failed to set RX mode, no memory.\n");
 		return;
 	}
 
-	rtnl_lock();
-
-	*promisc_allmulti = !!(dev->flags & IFF_PROMISC);
+	*promisc_allmulti = netif_rx_mode_get_bit(dev,
+						  NETIF_RX_MODE_PROM_EN);
 	sg_init_one(sg, promisc_allmulti, sizeof(*promisc_allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
@@ -3891,7 +3886,8 @@ static void virtnet_rx_mode_work(struct work_struct *work)
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
 			 *promisc_allmulti ? "en" : "dis");
 
-	*promisc_allmulti = !!(dev->flags & IFF_ALLMULTI);
+	*promisc_allmulti = netif_rx_mode_get_bit(dev,
+						  NETIF_RX_MODE_ALLMULTI_EN);
 	sg_init_one(sg, promisc_allmulti, sizeof(*promisc_allmulti));
 
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
@@ -3899,27 +3895,24 @@ static void virtnet_rx_mode_work(struct work_struct *work)
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
 			 *promisc_allmulti ? "en" : "dis");
 
-	netif_addr_lock_bh(dev);
+	uc_count = netif_rx_mode_get_uc_count(dev);
+	mc_count = netif_rx_mode_get_mc_count(dev);
 
-	uc_count = netdev_uc_count(dev);
-	mc_count = netdev_mc_count(dev);
 	/* MAC filter - use one buffer for both lists */
 	buf = kzalloc(((uc_count + mc_count) * ETH_ALEN) +
 		      (2 * sizeof(mac_data->entries)), GFP_ATOMIC);
 	mac_data = buf;
-	if (!buf) {
-		netif_addr_unlock_bh(dev);
-		rtnl_unlock();
+	if (!buf)
 		return;
-	}
 
 	sg_init_table(sg, 2);
 
 	/* Store the unicast list and count in the front of the buffer */
 	mac_data->entries = cpu_to_virtio32(vi->vdev, uc_count);
 	i = 0;
-	netdev_for_each_uc_addr(ha, dev)
-		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
+
+	netif_rx_mode_for_each_uc_addr(dev, ha_addr, idx)
+		memcpy(&mac_data->macs[i++][0], ha_addr, ETH_ALEN);
 
 	sg_set_buf(&sg[0], mac_data,
 		   sizeof(mac_data->entries) + (uc_count * ETH_ALEN));
@@ -3929,10 +3922,8 @@ static void virtnet_rx_mode_work(struct work_struct *work)
 
 	mac_data->entries = cpu_to_virtio32(vi->vdev, mc_count);
 	i = 0;
-	netdev_for_each_mc_addr(ha, dev)
-		memcpy(&mac_data->macs[i++][0], ha->addr, ETH_ALEN);
-
-	netif_addr_unlock_bh(dev);
+	netif_rx_mode_for_each_mc_addr(dev, ha_addr, idx)
+		memcpy(&mac_data->macs[i++][0], ha_addr, ETH_ALEN);
 
 	sg_set_buf(&sg[1], mac_data,
 		   sizeof(mac_data->entries) + (mc_count * ETH_ALEN));
@@ -3941,17 +3932,18 @@ static void virtnet_rx_mode_work(struct work_struct *work)
 				  VIRTIO_NET_CTRL_MAC_TABLE_SET, sg))
 		dev_warn(&dev->dev, "Failed to set MAC filter table.\n");
 
-	rtnl_unlock();
-
 	kfree(buf);
 }
 
 static void virtnet_set_rx_mode(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	char cfg_disabled;
+
+	cfg_disabled = !vi->rx_mode_work_enabled;
+	netif_rx_mode_set_bit(dev, NETIF_RX_MODE_SET_DIS, cfg_disabled);
 
-	if (vi->rx_mode_work_enabled)
-		schedule_work(&vi->rx_mode_work);
+	netif_rx_mode_set_priv_ptr(dev, vi);
 }
 
 static int virtnet_vlan_rx_add_vid(struct net_device *dev,
@@ -5767,7 +5759,7 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
 	/* Make sure no work handler is accessing the device */
 	flush_work(&vi->config_work);
 	disable_rx_mode_work(vi);
-	flush_work(&vi->rx_mode_work);
+	netif_rx_mode_flush_work(vi->dev);
 
 	if (netif_running(vi->dev)) {
 		rtnl_lock();
@@ -6270,6 +6262,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_validate_addr   = eth_validate_addr,
 	.ndo_set_mac_address = virtnet_set_mac_address,
 	.ndo_set_rx_mode     = virtnet_set_rx_mode,
+	.ndo_write_rx_mode   = virtnet_write_rx_mode,
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
@@ -6891,7 +6884,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vdev->priv = vi;
 
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	INIT_WORK(&vi->rx_mode_work, virtnet_rx_mode_work);
 	spin_lock_init(&vi->refill_lock);
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
@@ -7196,7 +7188,7 @@ static void virtnet_remove(struct virtio_device *vdev)
 	/* Make sure no work handler is accessing the device. */
 	flush_work(&vi->config_work);
 	disable_rx_mode_work(vi);
-	flush_work(&vi->rx_mode_work);
+	netif_rx_mode_flush_work(vi->dev);
 
 	virtnet_free_irq_moder(vi);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot
  2025-11-20 14:13 [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot I Viswanath
  2025-11-20 14:13 ` [PATCH net-next v5 1/2] net: refactor set_rx_mode into snapshot and deferred I/O I Viswanath
  2025-11-20 14:13 ` [PATCH net-next v5 2/2] virtio-net: Implement ndo_write_rx_mode callback I Viswanath
@ 2025-11-20 15:17 ` Jakub Kicinski
  2025-11-21 17:48   ` I Viswanath
  2 siblings, 1 reply; 5+ messages in thread
From: Jakub Kicinski @ 2025-11-20 15:17 UTC (permalink / raw)
  To: I Viswanath
  Cc: andrew+netdev, davem, edumazet, pabeni, horms, sdf, kuniyu,
	skhawaja, aleksander.lobakin, mst, jasowang, xuanzhuo, eperezma,
	virtualization, netdev, linux-kernel, linux-kernel-mentees

On Thu, 20 Nov 2025 19:43:52 +0530 I Viswanath wrote:
> Teardown path:
> 
> Relevant as they flush the work item. ens4 is the virtio-net interface.
> 
> virtnet_remove:
> ip maddr add 01:00:5e:00:03:02 dev ens4; echo 1 > /sys/bus/pci/devices/0000:00:04.0/remove
> 
> virtnet_freeze_down:
> ip maddr add 01:00:5e:00:03:02 dev ens4; echo mem > /sys/power/state

Running 

make -C tools/testing/selftests TARGETS="drivers/net/virtio_net" run_tests

[    1.967073] BUG: kernel NULL pointer dereference, address: 0000000000000018
[    1.967179] #PF: supervisor read access in kernel mode
[    1.967237] #PF: error_code(0x0000) - not-present page
[    1.967296] PGD 0 P4D 0 
[    1.967327] Oops: Oops: 0000 [#1] SMP
[    1.967372] CPU: 2 UID: 0 PID: 220 Comm: basic_features. Not tainted 6.18.0-rc5-virtme #1 PREEMPT(voluntary) 
[    1.967500] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    1.967576] RIP: 0010:__flush_work+0x33/0x3a0
[    1.967651] Code: 41 55 41 54 55 53 48 83 ec 60 44 0f b6 25 0d ab 91 01 65 48 8b 05 2d ff 8d 01 48 89 44 24 58 31 c0 45 84 e4 0f 84 35 03 00 00 <48> 83 7f 18 00 48 89 fd 0f 84 30 03 00 00 41 89 f5 e8 07 24 07 00
[    1.967861] RSP: 0018:ffffab9bc0597cf0 EFLAGS: 00010202
[    1.967920] RAX: 0000000000000000 RBX: ffff9d08c2c549c0 RCX: ffffab9bc0597d28
[    1.968010] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.968100] RBP: ffff9d08c1cd5000 R08: ffff9d08c1db7b70 R09: 0000000000000000
[    1.968189] R10: ffff9d08c1db7f80 R11: ffff9d08c152e480 R12: 0000000000000001
[    1.968281] R13: ffffffffbd9ffe00 R14: ffff9d08c193e140 R15: 0000000000000008
[    1.968371] FS:  00007fb66173b000(0000) GS:ffff9d0940ce9000(0000) knlGS:0000000000000000
[    1.968472] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.968546] CR2: 0000000000000018 CR3: 00000000045e5006 CR4: 0000000000772ef0
[    1.968640] PKRU: 55555554
[    1.968669] Call Trace:
[    1.968700]  <TASK>
[    1.968729]  ? kernfs_should_drain_open_files+0x2e/0x40
[    1.968796]  ? __rtnl_unlock+0x37/0x60
[    1.968849]  ? netdev_run_todo+0x63/0x550
[    1.968894]  ? kernfs_name_hash+0x12/0x80
[    1.968938]  virtnet_remove+0x65/0xb0
[    1.968984]  virtio_dev_remove+0x3c/0x80
[    1.969029]  device_release_driver_internal+0x193/0x200
[    1.969090]  unbind_store+0x9d/0xb0
[    1.969136]  kernfs_fop_write_iter+0x12b/0x1c0
[    1.969197]  vfs_write+0x33a/0x470
[    1.969242]  ksys_write+0x65/0xe0
[    1.969287]  do_syscall_64+0xa4/0xfd0
[    1.969333]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[    1.969393] RIP: 0033:0x7fb66183b257
[    1.969434] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[    1.969640] RSP: 002b:00007fffca552fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[    1.969729] RAX: ffffffffffffffda RBX: 00007fb661937780 RCX: 00007fb66183b257
[    1.969820] RDX: 0000000000000008 RSI: 0000558903e5a280 RDI: 0000000000000001
[    1.969911] RBP: 0000000000000008 R08: 0000000000000003 R09: 0000000000000077
[    1.970004] R10: 0000000000000063 R11: 0000000000000246 R12: 0000000000000008
[    1.970096] R13: 0000558903e5a280 R14: 0000000000000008 R15: 00007fb6619329c0
[    1.970189]  </TASK>
[    1.970218] Modules linked in:
[    1.970266] CR2: 0000000000000018
[    1.970311] ---[ end trace 0000000000000000 ]---
[    1.970372] RIP: 0010:__flush_work+0x33/0x3a0
[    1.970441] Code: 41 55 41 54 55 53 48 83 ec 60 44 0f b6 25 0d ab 91 01 65 48 8b 05 2d ff 8d 01 48 89 44 24 58 31 c0 45 84 e4 0f 84 35 03 00 00 <48> 83 7f 18 00 48 89 fd 0f 84 30 03 00 00 41 89 f5 e8 07 24 07 00
[    1.970656] RSP: 0018:ffffab9bc0597cf0 EFLAGS: 00010202
[    1.970717] RAX: 0000000000000000 RBX: ffff9d08c2c549c0 RCX: ffffab9bc0597d28
[    1.970806] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    1.970897] RBP: ffff9d08c1cd5000 R08: ffff9d08c1db7b70 R09: 0000000000000000
[    1.970988] R10: ffff9d08c1db7f80 R11: ffff9d08c152e480 R12: 0000000000000001
[    1.971081] R13: ffffffffbd9ffe00 R14: ffff9d08c193e140 R15: 0000000000000008
[    1.971174] FS:  00007fb66173b000(0000) GS:ffff9d0940ce9000(0000) knlGS:0000000000000000
[    1.971264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.971337] CR2: 0000000000000018 CR3: 00000000045e5006 CR4: 0000000000772ef0
[    1.971431] PKRU: 55555554
[    1.971460] note: basic_features.[220] exited with irqs disabled
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot
  2025-11-20 15:17 ` [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot Jakub Kicinski
@ 2025-11-21 17:48   ` I Viswanath
  0 siblings, 0 replies; 5+ messages in thread
From: I Viswanath @ 2025-11-21 17:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: andrew+netdev, davem, edumazet, pabeni, horms, sdf, kuniyu,
	skhawaja, aleksander.lobakin, mst, jasowang, xuanzhuo, eperezma,
	virtualization, netdev, linux-kernel, linux-kernel-mentees

On Thu, 20 Nov 2025 at 20:47, Jakub Kicinski <kuba@kernel.org> wrote:

> Running
>
> make -C tools/testing/selftests TARGETS="drivers/net/virtio_net" run_tests

This bug seems to be caused by a call to probe() followed by remove()
without ever calling
dev_open() as dev->rx_mode_ctx is allocated there. Modifying
netif_rx_mode_flush_work()
to call flush_work only when netif_running() is true, seems to fix
this specific bug.

However, I found the following deadlock while trying to reproduce that:

dev_close():
    rtnl_lock();
    cancel_work_sync(); // wait for netif_rx_mode_write_active to complete

netif_rx_mode_write_active(): // From work item

    rtnl_lock(); // Wait for the rtnl lock to be released

I can't find a good way to solve this without changing alloc logic to
be partly in
alloc_netdev_mqs since we need the work struct to be alive after
closing. Does this
look good if that's really the most reasonable solution:

struct netif_rx_mode_ctx *rx_mode_ctx;

struct netif_rx_mode_ctx {
    struct work_struct rx_mode_work;
    struct netif_rx_mode_active_ctx *active_ctx;
    int state;
}

struct netif_rx_mode_active_ctx {
        struct net_device               *dev;
        struct netif_rx_mode_config     *ready;
        struct netif_rx_mode_config     *pending;
}

rx_mode_ctx will be handled in alloc_netdev_mqs()/free_netdev() while active_ctx
will be handled in dev_open()/dev_close()

Never call flush_work/cancel_work_sync for this work in core
as that is a guaranteed deadlock because of how everything is serialized

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-21 17:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20 14:13 [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot I Viswanath
2025-11-20 14:13 ` [PATCH net-next v5 1/2] net: refactor set_rx_mode into snapshot and deferred I/O I Viswanath
2025-11-20 14:13 ` [PATCH net-next v5 2/2] virtio-net: Implement ndo_write_rx_mode callback I Viswanath
2025-11-20 15:17 ` [PATCH net-next v5 0/2] net: Split ndo_set_rx_mode into snapshot Jakub Kicinski
2025-11-21 17:48   ` I Viswanath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).