linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write
@ 2025-10-28 17:42 I Viswanath
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions: I Viswanath
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: I Viswanath @ 2025-10-28 17:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller
  Cc: netdev, linux-kernel, skhan, linux-kernel-mentees,
	david.hunter.linux, khalid, I Viswanath

This is an implementation of the idea provided by Jakub here

https://lore.kernel.org/netdev/20250923163727.5e97abdb@kernel.org/

ndo_set_rx_mode is problematic because it cannot sleep.

To address this, this series proposes dividing existing set_rx_mode
implementations into set_rx_mode and write_rx_config

The new set_rx_mode will be responsible for updating the rx_config
snapshot which will be used by ndo_write_rx_config to update the hardware

In brief, The callback implementations should look something like:

set_rx_mode():
    prepare_rx_config();
    update_snapshot();

write_rx_config():
    read_snapshot();
    do_io();

write_rx_config() is called from a work item making it sleepable
during the do_io() section.

This model should work correctly if the following conditions hold:

1. write_rx_config should use the rx_config set by the most recent
    call to set_rx_mode before its execution.

2. If a set_rx_mode call happens during execution of write_rx_config,
    write_rx_config should be rescheduled.

3. All calls to modify rx_mode should pass through the set_rx_mode +
    schedule write_rx_config execution flow.

1 and 2 are guaranteed because of the properties of work queues

Drivers need to ensure 3

ndo_write_rx_config has been implemented for 8139cp driver as proof of
concept

To use this model, a driver needs to implement the
ndo_write_rx_config callback, have a member rx_config in
the priv struct and replace all calls to set rx mode with
schedule_and_set_rx_mode();
---
v1:
Link: https://lore.kernel.org/netdev/20251020134857.5820-1-viswanathiyyappan@gmail.com/

v2:
- Exported set_and_schedule_rx_config as a symbol for use in modules
- Fixed incorrect cleanup for the case of rx_work alloc failing in alloc_netdev_mqs
- Removed the locked version (cp_set_rx_mode) and renamed __cp_set_rx_mode to cp_set_rx_mode
Link: https://lore.kernel.org/netdev/20251026175445.1519537-1-viswanathiyyappan@gmail.com/

v3:
- Added RFT tag
- Corrected mangled patch

I Viswanath (2):
  net: Add ndo_write_rx_config and helper structs and functions:
  net: ethernet: Implement ndo_write_rx_config callback for the 8139cp
    driver

 drivers/net/ethernet/realtek/8139cp.c | 78 ++++++++++++++++-----------
 include/linux/netdevice.h             | 38 ++++++++++++-
 net/core/dev.c                        | 53 ++++++++++++++++--
 3 files changed, 131 insertions(+), 38 deletions(-)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-10-28 17:42 [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
@ 2025-10-28 17:42 ` I Viswanath
  2025-10-31  2:20   ` Jakub Kicinski
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 2/2] net: ethernet: Implement ndo_write_rx_config callback for the 8139cp driver I Viswanath
  2025-10-29  4:48 ` [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
  2 siblings, 1 reply; 10+ messages in thread
From: I Viswanath @ 2025-10-28 17:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller
  Cc: netdev, linux-kernel, skhan, linux-kernel-mentees,
	david.hunter.linux, khalid, I Viswanath

Add ndo_write_rx_config callback and helper structs/functions:

	rx_config_work - To schedule the callback and handle synchronization

	read_snapshot/update_snapshot - Helper functions to read/update the
		 rx_config snapshot

	set_and_schedule_rx_config - Helper function to call ndo_set_rx_mode
		 and schedule ndo_write_rx_config

	execute_write_rx_config - Helper function that will be scheduled
		 by rx_work->config_write

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
---
 include/linux/netdevice.h | 38 +++++++++++++++++++++++++++-
 net/core/dev.c            | 53 +++++++++++++++++++++++++++++++++++----
 2 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f5aad5cc9a1..9f188229443c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1100,7 +1100,12 @@ struct netdev_net_notifier {
  * void (*ndo_set_rx_mode)(struct net_device *dev);
  *	This function is called device changes address list filtering.
  *	If driver handles unicast address filtering, it should set
- *	IFF_UNICAST_FLT in its priv_flags.
+ *	IFF_UNICAST_FLT in its priv_flags. This sets up the snapshot of
+ *	rx_config that will be written to the device.
+ *
+ * void (*ndo_write_rx_config)(struct net_device *dev);
+ *	This function is scheduled immediately after ndo_set_rx_mode to
+ *	write rx_config to the device.
  *
  * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
  *	This function  is called when the Media Access Control address
@@ -1421,6 +1426,7 @@ struct net_device_ops {
 	void			(*ndo_change_rx_flags)(struct net_device *dev,
 						       int flags);
 	void			(*ndo_set_rx_mode)(struct net_device *dev);
+	void			(*ndo_write_rx_config)(struct net_device *dev);
 	int			(*ndo_set_mac_address)(struct net_device *dev,
 						       void *addr);
 	int			(*ndo_validate_addr)(struct net_device *dev);
@@ -1767,6 +1773,12 @@ enum netdev_reg_state {
 	NETREG_DUMMY,		/* dummy device for NAPI poll */
 };
 
+struct rx_config_work {
+	struct work_struct config_write;
+	struct net_device *dev;
+	spinlock_t config_lock;
+};
+
 /**
  *	struct net_device - The DEVICE structure.
  *
@@ -2082,6 +2094,8 @@ enum netdev_reg_state {
  *			dev_list, one per address-family.
  *	@hwprov: Tracks which PTP performs hardware packet time stamping.
  *
+ *	@rx_work: helper struct to schedule rx config write to the hardware.
+ *
  *	FIXME: cleanup struct net_device such that network protocol info
  *	moves out.
  */
@@ -2559,6 +2573,8 @@ struct net_device {
 
 	struct hwtstamp_provider __rcu	*hwprov;
 
+	struct rx_config_work *rx_work;
+
 	u8			priv[] ____cacheline_aligned
 				       __counted_by(priv_len);
 } ____cacheline_aligned;
@@ -2734,6 +2750,26 @@ void dev_net_set(struct net_device *dev, struct net *net)
 	write_pnet(&dev->nd_net, net);
 }
 
+#define update_snapshot(config_ptr, type)						\
+	do {										\
+		typeof((config_ptr)) rx_config = ((type *)(dev->priv))->rx_config;	\
+		unsigned long flags;							\
+		spin_lock_irqsave(&((dev)->rx_work->config_lock), flags);		\
+		*rx_config = *(config_ptr);						\
+		spin_unlock_irqrestore(&((dev)->rx_work->config_lock), flags);		\
+	} while (0)
+
+#define read_snapshot(config_ptr, type)						\
+	do {										\
+		typeof((config_ptr)) rx_config = ((type *)(dev->priv))->rx_config;	\
+		unsigned long flags;							\
+		spin_lock_irqsave(&((dev)->rx_work->config_lock), flags);		\
+		*(config_ptr) = *rx_config;						\
+		spin_unlock_irqrestore(&((dev)->rx_work->config_lock), flags);		\
+	} while (0)
+
+void set_and_schedule_rx_config(struct net_device *dev, bool flush);
+
 /**
  *	netdev_priv - access network device private data
  *	@dev: network device
diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf..69b8bfc0edeb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9571,6 +9571,37 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
 	return 0;
 }
 
+static void execute_write_rx_config(struct work_struct *param)
+{
+	struct rx_config_work *rx_work = container_of(param,
+					 struct rx_config_work,
+					 config_write);
+	struct net_device *dev = rx_work->dev;
+
+	// This path should not be hit outside the work item
+	WARN_ON(!dev->netdev_ops->ndo_write_rx_config);
+	dev->netdev_ops->ndo_write_rx_config(dev);
+}
+
+/*	Sets up the rx_config snapshot and schedules write_rx_config. If
+ *	it's necessary to wait for completion of write_rx_config, set
+ *	flush to true.
+ */
+void set_and_schedule_rx_config(struct net_device *dev, bool flush)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_set_rx_mode)
+		ops->ndo_set_rx_mode(dev);
+
+	if (ops->ndo_write_rx_config) {
+		schedule_work(&dev->rx_work->config_write);
+		if (flush)
+			flush_work(&dev->rx_work->config_write);
+	}
+}
+EXPORT_SYMBOL(set_and_schedule_rx_config);
+
 /*
  *	Upload unicast and multicast address lists to device and
  *	configure RX filtering. When the device doesn't support unicast
@@ -9579,8 +9610,6 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
  */
 void __dev_set_rx_mode(struct net_device *dev)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-
 	/* dev_open will call this function so the list will stay sane. */
 	if (!(dev->flags&IFF_UP))
 		return;
@@ -9601,8 +9630,7 @@ void __dev_set_rx_mode(struct net_device *dev)
 		}
 	}
 
-	if (ops->ndo_set_rx_mode)
-		ops->ndo_set_rx_mode(dev);
+	set_and_schedule_rx_config(dev, false);
 }
 
 void dev_set_rx_mode(struct net_device *dev)
@@ -11961,9 +11989,17 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	refcount_set(&dev->dev_refcnt, 1);
 #endif
 
-	if (dev_addr_init(dev))
+	dev->rx_work = kmalloc(sizeof(*dev->rx_work), GFP_KERNEL);
+	if (!dev->rx_work)
 		goto free_pcpu;
 
+	dev->rx_work->dev = dev;
+	spin_lock_init(&dev->rx_work->config_lock);
+	INIT_WORK(&dev->rx_work->config_write, execute_write_rx_config);
+
+	if (dev_addr_init(dev))
+		goto free_rx_work;
+
 	dev_mc_init(dev);
 	dev_uc_init(dev);
 
@@ -12045,6 +12081,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	free_netdev(dev);
 	return NULL;
 
+free_rx_work:
+	cancel_work_sync(&dev->rx_work->config_write);
+	kfree(dev->rx_work);
+
 free_pcpu:
 #ifdef CONFIG_PCPU_DEV_REFCNT
 	free_percpu(dev->pcpu_refcnt);
@@ -12130,6 +12170,9 @@ void free_netdev(struct net_device *dev)
 		return;
 	}
 
+	cancel_work_sync(&dev->rx_work->config_write);
+	kfree(dev->rx_work);
+
 	BUG_ON(dev->reg_state != NETREG_UNREGISTERED);
 	WRITE_ONCE(dev->reg_state, NETREG_RELEASED);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC/RFT PATCH net-next v3 2/2] net: ethernet: Implement ndo_write_rx_config callback for the 8139cp driver
  2025-10-28 17:42 [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions: I Viswanath
@ 2025-10-28 17:42 ` I Viswanath
  2025-10-29  4:48 ` [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
  2 siblings, 0 replies; 10+ messages in thread
From: I Viswanath @ 2025-10-28 17:42 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller
  Cc: netdev, linux-kernel, skhan, linux-kernel-mentees,
	david.hunter.linux, khalid, I Viswanath

Implement ndo_write_rx_config for the 8139cp driver

Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
---
 drivers/net/ethernet/realtek/8139cp.c | 78 ++++++++++++++++-----------
 1 file changed, 46 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 5652da8a178c..ae3dca4f9b73 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -319,6 +319,11 @@ struct cp_extra_stats {
 	unsigned long		rx_frags;
 };
 
+struct cp_rx_config {
+	int rx_mode;
+	u32 mc_filter[2];	/* Multicast hash filter */
+};
+
 struct cp_private {
 	void			__iomem *regs;
 	struct net_device	*dev;
@@ -328,7 +333,7 @@ struct cp_private {
 	struct napi_struct	napi;
 
 	struct pci_dev		*pdev;
-	u32			rx_config;
+	struct cp_rx_config	*rx_config;
 	u16			cpcmd;
 
 	struct cp_extra_stats	cp_stats;
@@ -372,7 +377,6 @@ struct cp_private {
 	} while (0)
 
 
-static void __cp_set_rx_mode (struct net_device *dev);
 static void cp_tx (struct cp_private *cp);
 static void cp_clean_rings (struct cp_private *cp);
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -882,55 +886,53 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb,
 	goto out_unlock;
 }
 
+static void cp_write_rx_config(struct net_device *dev)
+{
+	struct cp_private *cp = netdev_priv(dev);
+	struct cp_rx_config snapshot;
+
+	read_snapshot(&snapshot, struct cp_private);
+
+	/* We can safely update without stopping the chip. */
+	cpw32_f(RxConfig, snapshot.rx_mode);
+
+	cpw32_f(MAR0 + 0, snapshot.mc_filter[0]);
+	cpw32_f(MAR0 + 4, snapshot.mc_filter[1]);
+}
+
 /* Set or clear the multicast filter for this adaptor.
    This routine is not state sensitive and need not be SMP locked. */
 
-static void __cp_set_rx_mode (struct net_device *dev)
+static void cp_set_rx_mode (struct net_device *dev)
 {
-	struct cp_private *cp = netdev_priv(dev);
-	u32 mc_filter[2];	/* Multicast hash filter */
-	int rx_mode;
+	struct cp_rx_config new_config;
 
 	/* Note: do not reorder, GCC is clever about common statements. */
 	if (dev->flags & IFF_PROMISC) {
 		/* Unconditionally log net taps. */
-		rx_mode =
+		new_config.rx_mode =
 		    AcceptBroadcast | AcceptMulticast | AcceptMyPhys |
 		    AcceptAllPhys;
-		mc_filter[1] = mc_filter[0] = 0xffffffff;
+		new_config.mc_filter[1] = new_config.mc_filter[0] = 0xffffffff;
 	} else if ((netdev_mc_count(dev) > multicast_filter_limit) ||
 		   (dev->flags & IFF_ALLMULTI)) {
 		/* Too many to filter perfectly -- accept all multicasts. */
-		rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys;
-		mc_filter[1] = mc_filter[0] = 0xffffffff;
+		new_config.rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys;
+		new_config.mc_filter[1] = new_config.mc_filter[0] = 0xffffffff;
 	} else {
 		struct netdev_hw_addr *ha;
-		rx_mode = AcceptBroadcast | AcceptMyPhys;
-		mc_filter[1] = mc_filter[0] = 0;
+		new_config.rx_mode = AcceptBroadcast | AcceptMyPhys;
+		new_config.mc_filter[1] = new_config.mc_filter[0] = 0;
 		netdev_for_each_mc_addr(ha, dev) {
 			int bit_nr = ether_crc(ETH_ALEN, ha->addr) >> 26;
 
-			mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
-			rx_mode |= AcceptMulticast;
+			new_config.mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
+			new_config.rx_mode |= AcceptMulticast;
 		}
 	}
 
-	/* We can safely update without stopping the chip. */
-	cp->rx_config = cp_rx_config | rx_mode;
-	cpw32_f(RxConfig, cp->rx_config);
-
-	cpw32_f (MAR0 + 0, mc_filter[0]);
-	cpw32_f (MAR0 + 4, mc_filter[1]);
-}
-
-static void cp_set_rx_mode (struct net_device *dev)
-{
-	unsigned long flags;
-	struct cp_private *cp = netdev_priv(dev);
-
-	spin_lock_irqsave (&cp->lock, flags);
-	__cp_set_rx_mode(dev);
-	spin_unlock_irqrestore (&cp->lock, flags);
+	new_config.rx_mode = cp_rx_config | new_config.rx_mode;
+	update_snapshot(&new_config, struct cp_private);
 }
 
 static void __cp_get_stats(struct cp_private *cp)
@@ -1040,7 +1042,7 @@ static void cp_init_hw (struct cp_private *cp)
 	cp_start_hw(cp);
 	cpw8(TxThresh, 0x06); /* XXX convert magic num to a constant */
 
-	__cp_set_rx_mode(dev);
+	set_and_schedule_rx_config(dev, true);
 	cpw32_f (TxConfig, IFG | (TX_DMA_BURST << TxDMAShift));
 
 	cpw8(Config1, cpr8(Config1) | DriverLoaded | PMEnable);
@@ -1188,6 +1190,12 @@ static int cp_open (struct net_device *dev)
 	if (rc)
 		return rc;
 
+	cp->rx_config = kmalloc(sizeof(*cp->rx_config), GFP_KERNEL);
+	if (!cp->rx_config) {
+		rc = -ENOMEM;
+		goto err_out_rx_config;
+	}
+
 	napi_enable(&cp->napi);
 
 	cp_init_hw(cp);
@@ -1207,6 +1215,9 @@ static int cp_open (struct net_device *dev)
 err_out_hw:
 	napi_disable(&cp->napi);
 	cp_stop_hw(cp);
+	kfree(cp->rx_config);
+
+err_out_rx_config:
 	cp_free_rings(cp);
 	return rc;
 }
@@ -1227,6 +1238,8 @@ static int cp_close (struct net_device *dev)
 
 	cp_stop_hw(cp);
 
+	kfree(cp->rx_config);
+
 	spin_unlock_irqrestore(&cp->lock, flags);
 
 	free_irq(cp->pdev->irq, dev);
@@ -1262,7 +1275,7 @@ static void cp_tx_timeout(struct net_device *dev, unsigned int txqueue)
 	cp_clean_rings(cp);
 	cp_init_rings(cp);
 	cp_start_hw(cp);
-	__cp_set_rx_mode(dev);
+	set_and_schedule_rx_config(dev, false);
 	cpw16_f(IntrMask, cp_norx_intr_mask);
 
 	netif_wake_queue(dev);
@@ -1870,6 +1883,7 @@ static const struct net_device_ops cp_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address 	= cp_set_mac_address,
 	.ndo_set_rx_mode	= cp_set_rx_mode,
+	.ndo_write_rx_config    = cp_write_rx_config,
 	.ndo_get_stats		= cp_get_stats,
 	.ndo_eth_ioctl		= cp_ioctl,
 	.ndo_start_xmit		= cp_start_xmit,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write
  2025-10-28 17:42 [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions: I Viswanath
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 2/2] net: ethernet: Implement ndo_write_rx_config callback for the 8139cp driver I Viswanath
@ 2025-10-29  4:48 ` I Viswanath
  2 siblings, 0 replies; 10+ messages in thread
From: I Viswanath @ 2025-10-29  4:48 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, andrew+netdev
  Cc: netdev, linux-kernel, skhan, linux-kernel-mentees,
	david.hunter.linux, khalid

Hi Andrew,
     I forgot to cc you on the original patch. Here's the link:
https://lore.kernel.org/netdev/20251028174222.1739954-1-viswanathiyyappan@gmail.com/

Thanks,
Viswanath

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions: I Viswanath
@ 2025-10-31  2:20   ` Jakub Kicinski
  2025-11-04 16:43     ` I Viswanath
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-10-31  2:20 UTC (permalink / raw)
  To: I Viswanath
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Tue, 28 Oct 2025 23:12:21 +0530 I Viswanath wrote:
> @@ -1421,6 +1426,7 @@ struct net_device_ops {
>  	void			(*ndo_change_rx_flags)(struct net_device *dev,
>  						       int flags);
>  	void			(*ndo_set_rx_mode)(struct net_device *dev);
> +	void			(*ndo_write_rx_config)(struct net_device *dev);
>  	int			(*ndo_set_mac_address)(struct net_device *dev,
>  						       void *addr);
>  	int			(*ndo_validate_addr)(struct net_device *dev);
> @@ -1767,6 +1773,12 @@ enum netdev_reg_state {
>  	NETREG_DUMMY,		/* dummy device for NAPI poll */
>  };
>  
> +struct rx_config_work {

pls make sure to prefix names of types and functions with netif,
netdev or net

> +	struct work_struct config_write;
> +	struct net_device *dev;
> +	spinlock_t config_lock;
> +};
> +

> +#define update_snapshot(config_ptr, type)						\
> +	do {										\
> +		typeof((config_ptr)) rx_config = ((type *)(dev->priv))->rx_config;	\
> +		unsigned long flags;							\
> +		spin_lock_irqsave(&((dev)->rx_work->config_lock), flags);		\
> +		*rx_config = *(config_ptr);						\
> +		spin_unlock_irqrestore(&((dev)->rx_work->config_lock), flags);		\
> +	} while (0)

The driver you picked is relatively trivial, advanced drivers need
to sync longer lists of mcast / ucast addresses. Bulk of the complexity
is in keeping those lists. Simple 

	*rx_config = *(config_ptr);

assignment is not enough. The driver needs to know old and new entries
and send ADD/DEL commands to FW. Converting virtio_net would be better,
but it does one huge dump which is also not representative of most
advanced NICs.

> @@ -11961,9 +11989,17 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	refcount_set(&dev->dev_refcnt, 1);
>  #endif
>  
> -	if (dev_addr_init(dev))
> +	dev->rx_work = kmalloc(sizeof(*dev->rx_work), GFP_KERNEL);

Let's only allocate any extra state if driver has the NDO

> +	if (!dev->rx_work)
>  		goto free_pcpu;
>  
> +	dev->rx_work->dev = dev;
> +	spin_lock_init(&dev->rx_work->config_lock);
> +	INIT_WORK(&dev->rx_work->config_write, execute_write_rx_config);
> +
> +	if (dev_addr_init(dev))
> +		goto free_rx_work;
> +
>  	dev_mc_init(dev);
>  	dev_uc_init(dev);
>  
> @@ -12045,6 +12081,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	free_netdev(dev);
>  	return NULL;
>  
> +free_rx_work:
> +	cancel_work_sync(&dev->rx_work->config_write);
> +	kfree(dev->rx_work);
> +
>  free_pcpu:
>  #ifdef CONFIG_PCPU_DEV_REFCNT
>  	free_percpu(dev->pcpu_refcnt);
> @@ -12130,6 +12170,9 @@ void free_netdev(struct net_device *dev)
>  		return;
>  	}
>  
> +	cancel_work_sync(&dev->rx_work->config_write);
> +	kfree(dev->rx_work);

We need to shut down sooner, some time between ndo_stop and ndo_uninit


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-10-31  2:20   ` Jakub Kicinski
@ 2025-11-04 16:43     ` I Viswanath
  2025-11-05  0:46       ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: I Viswanath @ 2025-11-04 16:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Fri, 31 Oct 2025 at 07:50, Jakub Kicinski <kuba@kernel.org> wrote:

> pls make sure to prefix names of types and functions with netif,
> netdev or net

I think netif is the prefix that makes the most sense here. I will do that

> The driver you picked is relatively trivial, advanced drivers need
> to sync longer lists of mcast / ucast addresses. Bulk of the complexity
> is in keeping those lists. Simple
>
>         *rx_config = *(config_ptr);
>
> assignment is not enough.

Apologies, I had the wrong mental model of the snapshot.

From what I understand, the snapshot should look something like

struct netif_rx_config {
    char *uc_addrs; // of size uc_count * dev->addr_len
    char *mc_addrs; // of size mc_count * dev->addr_len
    int uc_count;
    int mc_count;
    bool multi_en, promisc_en, vlan_en;
    void *device_specific_config;
}
Correct me if I have missed anything

Does the following pseudocode/skeleton make sense?

update_config() will be called at end of set_rx_mode()

read_config() is execute_write_rx_config() and do_io() is
dev->netdev_ops->ndo_write_rx_config() named that way
for consistency (since read/update)

atomic_t cfg_in_use = ATOMIC_INIT(false);
atomic_t cfg_update_pending = ATOMIC_INIT(false);

struct netif_rx_config *active, *staged;

void update_config()
{
    int was_config_pending = atomic_xchg(&cfg_update_pending, false);

    // If prepare_config fails, it leaves staged untouched
    // So, we check for and apply if pending update
    int rc = prepare_config(&staged);
    if (rc && !was_config_pending)
        return;

    if (atomic_read(&cfg_in_use)) {
        atomic_set(&cfg_update_pending, true);
        return;
    }
    swap(active, staged);
}

void read_config()
{
    atomic_set(&cfg_in_use, true);
    do_io(active);
    atomic_set(&cfg_in_use, false);

    // To account for the edge case where update_config() is called
    // during the execution of read_config() and there are no subsequent
    // calls to update_config()
    if (atomic_xchg(&cfg_update_pending, false))
        swap(active, staged);
}

>The driver needs to know old and new entries
> and send ADD/DEL commands to FW. Converting virtio_net would be better,
> but it does one huge dump which is also not representative of most
> advanced NICs.

We can definitely do this in prepare_config()
Speaking of which, How big can uc_count and mc_count be?

Would krealloc(buffer, uc_count * dev->addr_len, GFP_ATOMIC) be a good idea?

Well, virtio-net does kmalloc((uc_count + mc_count) * ETH_ALEN) + ...,
GFP_ATOMIC),
so this shouldn't introduce any new failures for virtio-net

> Let's only allocate any extra state if driver has the NDO
> We need to shut down sooner, some time between ndo_stop and ndo_uninit

Would it make sense to move init (if ndo exists) and cleanup to
__dev_open and __dev_close?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-11-04 16:43     ` I Viswanath
@ 2025-11-05  0:46       ` Jakub Kicinski
  2025-11-06 16:08         ` I Viswanath
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-11-05  0:46 UTC (permalink / raw)
  To: I Viswanath
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Tue, 4 Nov 2025 22:13:49 +0530 I Viswanath wrote:
> On Fri, 31 Oct 2025 at 07:50, Jakub Kicinski <kuba@kernel.org> wrote:
> > The driver you picked is relatively trivial, advanced drivers need
> > to sync longer lists of mcast / ucast addresses. Bulk of the complexity
> > is in keeping those lists. Simple
> >
> >         *rx_config = *(config_ptr);
> >
> > assignment is not enough.  
> 
> Apologies, I had the wrong mental model of the snapshot.
> 
> From what I understand, the snapshot should look something like
> 
> struct netif_rx_config {
>     char *uc_addrs; // of size uc_count * dev->addr_len
>     char *mc_addrs; // of size mc_count * dev->addr_len
>     int uc_count;
>     int mc_count;
>     bool multi_en, promisc_en, vlan_en;
>     void *device_specific_config;
> }
> Correct me if I have missed anything
> 
> Does the following pseudocode/skeleton make sense?
> 
> update_config() will be called at end of set_rx_mode()
> 
> read_config() is execute_write_rx_config() and do_io() is
> dev->netdev_ops->ndo_write_rx_config() named that way
> for consistency (since read/update)
> 
> atomic_t cfg_in_use = ATOMIC_INIT(false);
> atomic_t cfg_update_pending = ATOMIC_INIT(false);
> 
> struct netif_rx_config *active, *staged;
> 
> void update_config()
> {
>     int was_config_pending = atomic_xchg(&cfg_update_pending, false);
> 
>     // If prepare_config fails, it leaves staged untouched
>     // So, we check for and apply if pending update
>     int rc = prepare_config(&staged);
>     if (rc && !was_config_pending)
>         return;
> 
>     if (atomic_read(&cfg_in_use)) {
>         atomic_set(&cfg_update_pending, true);
>         return;
>     }
>     swap(active, staged);
> }
> 
> void read_config()
> {
>     atomic_set(&cfg_in_use, true);
>     do_io(active);
>     atomic_set(&cfg_in_use, false);
> 
>     // To account for the edge case where update_config() is called
>     // during the execution of read_config() and there are no subsequent
>     // calls to update_config()
>     if (atomic_xchg(&cfg_update_pending, false))
>         swap(active, staged);
> }

I wouldn't use atomic flags. IIRC ndo_set_rx_mode is called under
netif_addr_lock_bh(), so we can reuse that lock, have update_config()
assume ownership of the pending config and update it directly.
And read_config() (which IIUC runs from a wq) can take that lock
briefly, and swap which config is pending.

> >The driver needs to know old and new entries
> > and send ADD/DEL commands to FW. Converting virtio_net would be better,
> > but it does one huge dump which is also not representative of most
> > advanced NICs.  
> 
> We can definitely do this in prepare_config()
> Speaking of which, How big can uc_count and mc_count be?
> 
> Would krealloc(buffer, uc_count * dev->addr_len, GFP_ATOMIC) be a good idea?

Not sure about the max value but I'd think low thousands is probably 
a good target. IOW yes, I think one linear buffer may be a concern.
I'd think order 1 allocation may be fine tho..

> Well, virtio-net does kmalloc((uc_count + mc_count) * ETH_ALEN) + ...,
> GFP_ATOMIC),
> so this shouldn't introduce any new failures for virtio-net

Right but IDK if virtio is used on systems with the same sort of scale
as a large physical function driver..

> > Let's only allocate any extra state if driver has the NDO
> > We need to shut down sooner, some time between ndo_stop and ndo_uninit  
> 
> Would it make sense to move init (if ndo exists) and cleanup to
> __dev_open and __dev_close?

Yes, indeed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-11-05  0:46       ` Jakub Kicinski
@ 2025-11-06 16:08         ` I Viswanath
  2025-11-06 22:12           ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: I Viswanath @ 2025-11-06 16:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Wed, 5 Nov 2025 at 06:16, Jakub Kicinski <kuba@kernel.org> wrote:
> I wouldn't use atomic flags. IIRC ndo_set_rx_mode is called under
> netif_addr_lock_bh(), so we can reuse that lock, have update_config()
> assume ownership of the pending config and update it directly.
> And read_config() (which IIUC runs from a wq) can take that lock
> briefly, and swap which config is pending.

How does this look?

It's possible for the actual work of set_rx_mode to be in a work item
so we need to validate that dev->addr_list_lock is held in update_config()

// These variables will be part of dev->netif_rx_config_ctx in the final code
bool pending_cfg_ready = false;
struct netif_rx_config *ready, *pending;

void update_config()
{
    WARN_ONCE(!spin_is_locked(&dev->addr_list_lock),
    "netif_update_rx_config() called without netif_addr_lock_bh()\n");

    int rc = netif_prepare_rx_config(&pending);
    if (rc)
        return;

    pending_cfg_ready = true;
}

void read_config()
{
    // We could introduce a new lock for this but
    // reusing the addr lock works well enough
    netif_addr_lock_bh();

    // There's no point continuing if the pending config
    // is not ready
    if(!pending_cfg_ready) {
       netif_addr_unlock_bh();
       return;
    }

    swap(ready, pending);
    pending_cfg_ready = false;

    netif_addr_unlock_bh();

    do_io(ready);
}

On the topic of virtio_net:

set_rx_mode in virtio_net schedules and does the actual work in a work
item, so would
the correct justification here be moving I/O out of the rtnl lock?

If this looks good, I will start working on v4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-11-06 22:12           ` Jakub Kicinski
@ 2025-11-06 20:03             ` I Viswanath
  0 siblings, 0 replies; 10+ messages in thread
From: I Viswanath @ 2025-11-06 20:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Fri, 7 Nov 2025 at 03:43, Jakub Kicinski <kuba@kernel.org> wrote:
> The objective is just to simplify the driver. Avoid the state
> management and work scheduling in every driver that needs to the config
> async. IOW we just want to give drivers an ndo_set_rx_mode that can
> sleep.

Fair enough. I will start working on v4 with virtio_net

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions:
  2025-11-06 16:08         ` I Viswanath
@ 2025-11-06 22:12           ` Jakub Kicinski
  2025-11-06 20:03             ` I Viswanath
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Kicinski @ 2025-11-06 22:12 UTC (permalink / raw)
  To: I Viswanath
  Cc: davem, edumazet, pabeni, horms, sdf, kuniyu, ahmed.zaki,
	aleksander.lobakin, jacob.e.keller, netdev, linux-kernel, skhan,
	linux-kernel-mentees, david.hunter.linux, khalid

On Thu, 6 Nov 2025 21:38:59 +0530 I Viswanath wrote:
> On Wed, 5 Nov 2025 at 06:16, Jakub Kicinski <kuba@kernel.org> wrote:
> > I wouldn't use atomic flags. IIRC ndo_set_rx_mode is called under
> > netif_addr_lock_bh(), so we can reuse that lock, have update_config()
> > assume ownership of the pending config and update it directly.
> > And read_config() (which IIUC runs from a wq) can take that lock
> > briefly, and swap which config is pending.  
> 
> How does this look?
> 
> It's possible for the actual work of set_rx_mode to be in a work item
> so we need to validate that dev->addr_list_lock is held in update_config()
> 
> // These variables will be part of dev->netif_rx_config_ctx in the final code
> bool pending_cfg_ready = false;
> struct netif_rx_config *ready, *pending;
> 
> void update_config()
> {
>     WARN_ONCE(!spin_is_locked(&dev->addr_list_lock),
>     "netif_update_rx_config() called without netif_addr_lock_bh()\n");
> 
>     int rc = netif_prepare_rx_config(&pending);
>     if (rc)
>         return;
> 
>     pending_cfg_ready = true;
> }
> 
> void read_config()
> {
>     // We could introduce a new lock for this but
>     // reusing the addr lock works well enough
>     netif_addr_lock_bh();
> 
>     // There's no point continuing if the pending config
>     // is not ready
>     if(!pending_cfg_ready) {
>        netif_addr_unlock_bh();
>        return;
>     }
> 
>     swap(ready, pending);
>     pending_cfg_ready = false;
> 
>     netif_addr_unlock_bh();
> 
>     do_io(ready);
> }

Yes, I think this flow looks good.

> On the topic of virtio_net:
> 
> set_rx_mode in virtio_net schedules and does the actual work in a work
> item, so would
> the correct justification here be moving I/O out of the rtnl lock?

Avoiding rtnl_lock is not a goal right now. We should still take
rtnl_lock around read_config() in case the driver assumes rtnl_lock
is held around all its config paths.

The objective is just to simplify the driver. Avoid the state
management and work scheduling in every driver that needs to the config
async. IOW we just want to give drivers an ndo_set_rx_mode that can
sleep.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-07  1:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 17:42 [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath
2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 1/2] net: Add ndo_write_rx_config and helper structs and functions: I Viswanath
2025-10-31  2:20   ` Jakub Kicinski
2025-11-04 16:43     ` I Viswanath
2025-11-05  0:46       ` Jakub Kicinski
2025-11-06 16:08         ` I Viswanath
2025-11-06 22:12           ` Jakub Kicinski
2025-11-06 20:03             ` I Viswanath
2025-10-28 17:42 ` [RFC/RFT PATCH net-next v3 2/2] net: ethernet: Implement ndo_write_rx_config callback for the 8139cp driver I Viswanath
2025-10-29  4:48 ` [RFC/RFT PATCH net-next v3 0/2] net: Split ndo_set_rx_mode into snapshot and deferred write I Viswanath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).