[PATCH net-next v5 00/14] net: sleepable ndo_set_rx

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode
@ 2026-04-02 22:55 Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure Stanislav Fomichev
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni

This series adds a new ndo_set_rx_mode_async callback that enables
drivers to handle address list updates in a sleepable context. The
current ndo_set_rx_mode is called under the netif_addr_lock spinlock
with BHs disabled, which prevents drivers from sleeping. This is
problematic for ops-locked drivers that need to sleep.

The approach:
1. Add snapshot/reconcile infrastructure for address lists
2. Introduce dev_rx_mode_work that takes snapshots under the lock,
   drops the lock, calls the driver, then reconciles changes back
3. Move promiscuity handling into the scheduled work as well
4. Convert existing ops-locked drivers to ndo_set_rx_mode_async
5. Add a warning for ops-locked drivers still using ndo_set_rx_mode
6. Add a selftest exercising the team+bridge+macvlan topology that
   triggers the addr_lock -> ops_lock ordering issue

v5:
- resolve 32 bit failure (Jakub)

v4:
- rebase on https://lore.kernel.org/netdev/20260319005456.82745-1-saeed@kernel.org/T/#u (Cosmin)
- reword ndo_set_rx_mode_async kdoc (Jakub)
- s/EXPORT_SYMBOL/EXPORT_SYMBOL_IF_KUNIT/ (Jakub)
- remove netif_up_and_present (Jakub)
- netif_addr_lists_snapshot + netif_addr_lists_reconcile to better
  explain mix-and-match between
  ndo_set_rx_mode/ndo_set_rx_mode_async/ndo_change_rx_flags (Jakub)
- s/cancel_work_sync/flush_work/ (Jakub)
- separate commit to cache snapshot entries (Jakub)
- add dev_addr_test_snapshot_benchmark (Jakub)
  - dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter
- remove redundant bnxt_uc_list_updated (Michael)
- switch to linkwatch-like work stealing (Jakub)

v3:
- module_export(__rtnl_unlock) (nipa)
- s/netdev_uc_count/netdev_hw_addr_list_count/ in bnxt (Aleksandr)

v2:
- wifi: cfg80211: use __rtnl_unlock in nl80211_pre_doit (syzbot)
- simplify mlx5e_sync_netdev_addr for !uc (Cosmin)
- switch to snapshot in bnxt_cfg_rx_mode (Michael)
- add team to net/config (Jakub)

Stanislav Fomichev (14):
  net: add address list snapshot and reconciliation infrastructure
  net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
  net: move promiscuity handling into netdev_rx_mode_work
  net: cache snapshot entries for ndo_set_rx_mode_async
  fbnic: convert to ndo_set_rx_mode_async
  mlx5: convert to ndo_set_rx_mode_async
  bnxt: convert to ndo_set_rx_mode_async
  bnxt: use snapshot in bnxt_cfg_rx_mode
  iavf: convert to ndo_set_rx_mode_async
  netdevsim: convert to ndo_set_rx_mode_async
  dummy: convert to ndo_set_rx_mode_async
  net: warn ops-locked drivers still using ndo_set_rx_mode
  selftests: net: add team_bridge_macvlan rx_mode test
  selftests: net: use ip commands instead of teamd in team rx_mode test

 Documentation/networking/netdevices.rst       |  13 +
 drivers/net/dummy.c                           |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  62 +--
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  14 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |   5 +-
 .../net/ethernet/mellanox/mlx5/core/en_fs.c   |  32 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  13 +-
 .../net/ethernet/meta/fbnic/fbnic_netdev.c    |  20 +-
 .../net/ethernet/meta/fbnic/fbnic_netdev.h    |   4 +-
 drivers/net/ethernet/meta/fbnic/fbnic_pci.c   |   4 +-
 drivers/net/ethernet/meta/fbnic/fbnic_rpc.c   |   2 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/netdevice.h                     |  26 ++
 net/core/dev.c                                |  69 +---
 net/core/dev.h                                |   4 +
 net/core/dev_addr_lists.c                     | 373 ++++++++++++++++-
 net/core/dev_addr_lists_test.c                | 387 +++++++++++++++++-
 net/core/dev_api.c                            |   3 +
 net/core/dev_ioctl.c                          |   6 +-
 net/core/rtnetlink.c                          |   1 +
 .../selftests/drivers/net/bonding/lag_lib.sh  |  17 +-
 .../drivers/net/team/dev_addr_lists.sh        |   2 -
 tools/testing/selftests/net/config            |   3 +
 tools/testing/selftests/net/rtnetlink.sh      |  44 ++
 24 files changed, 976 insertions(+), 142 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Stanislav Fomichev
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, Aleksandr Loktionov

Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile()
for use by the upcoming ndo_set_rx_mode_async callback.

The async rx_mode path needs to snapshot the device's unicast and
multicast address lists under the addr_lock, hand those snapshots
to the driver (which may sleep), and then propagate any sync_cnt
changes back to the real lists. Two identical snapshots are taken:
a work copy for the driver to pass to __hw_addr_sync_dev() and a
reference copy to compute deltas against.

__hw_addr_list_reconcile() walks the reference snapshot comparing
each entry against the work snapshot to determine what the driver
synced or unsynced. It then applies those deltas to the real list,
handling concurrent modifications:

  - If the real entry was concurrently removed but the driver synced
    it to hardware (delta > 0), re-insert a stale entry so the next
    work run properly unsyncs it from hardware.
  - If the entry still exists, apply the delta normally. An entry
    whose refcount drops to zero is removed.

  # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter
  # dev_addr_test_snapshot_benchmark.speed: slow

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 include/linux/netdevice.h      |   7 +
 net/core/dev.h                 |   1 +
 net/core/dev_addr_lists.c      | 112 +++++++++-
 net/core/dev_addr_lists_test.c | 363 ++++++++++++++++++++++++++++++++-
 4 files changed, 480 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e15367373f7c..151d6f4fd9b3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4995,6 +4995,13 @@ void __hw_addr_unsync_dev(struct netdev_hw_addr_list *list,
 			  int (*unsync)(struct net_device *,
 					const unsigned char *));
 void __hw_addr_init(struct netdev_hw_addr_list *list);
+void __hw_addr_flush(struct netdev_hw_addr_list *list);
+int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap,
+			    const struct netdev_hw_addr_list *list,
+			    int addr_len);
+void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
+			      struct netdev_hw_addr_list *work,
+			      struct netdev_hw_addr_list *ref, int addr_len);
 
 /* Functions used for device addresses handling */
 void dev_addr_mod(struct net_device *dev, unsigned int offset,
diff --git a/net/core/dev.h b/net/core/dev.h
index 781619e76b3e..acc925b7b337 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -69,6 +69,7 @@ void linkwatch_run_queue(void);
 void dev_addr_flush(struct net_device *dev);
 int dev_addr_init(struct net_device *dev);
 void dev_addr_check(struct net_device *dev);
+void __hw_addr_flush(struct netdev_hw_addr_list *list);
 
 #if IS_ENABLED(CONFIG_NET_SHAPER)
 void net_shaper_flush_netdev(struct net_device *dev);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 76c91f224886..284fcfe9389b 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -11,6 +11,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/export.h>
 #include <linux/list.h>
+#include <kunit/visibility.h>
 
 #include "dev.h"
 
@@ -481,7 +482,7 @@ void __hw_addr_unsync_dev(struct netdev_hw_addr_list *list,
 }
 EXPORT_SYMBOL(__hw_addr_unsync_dev);
 
-static void __hw_addr_flush(struct netdev_hw_addr_list *list)
+void __hw_addr_flush(struct netdev_hw_addr_list *list)
 {
 	struct netdev_hw_addr *ha, *tmp;
 
@@ -492,6 +493,7 @@ static void __hw_addr_flush(struct netdev_hw_addr_list *list)
 	}
 	list->count = 0;
 }
+EXPORT_SYMBOL_IF_KUNIT(__hw_addr_flush);
 
 void __hw_addr_init(struct netdev_hw_addr_list *list)
 {
@@ -501,6 +503,114 @@ void __hw_addr_init(struct netdev_hw_addr_list *list)
 }
 EXPORT_SYMBOL(__hw_addr_init);
 
+/**
+ *  __hw_addr_list_snapshot - create a snapshot copy of an address list
+ *  @snap: destination snapshot list (needs to be __hw_addr_init-initialized)
+ *  @list: source address list to snapshot
+ *  @addr_len: length of addresses
+ *
+ *  Creates a copy of @list with individually allocated entries suitable
+ *  for use with __hw_addr_sync_dev() and other list manipulation helpers.
+ *  Each entry is allocated with GFP_ATOMIC; must be called under a spinlock.
+ *
+ *  Return: 0 on success, -errno on failure.
+ */
+int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap,
+			    const struct netdev_hw_addr_list *list,
+			    int addr_len)
+{
+	struct netdev_hw_addr *ha, *entry;
+
+	list_for_each_entry(ha, &list->list, list) {
+		entry = __hw_addr_create(ha->addr, addr_len, ha->type,
+					 false, false);
+		if (!entry) {
+			__hw_addr_flush(snap);
+			return -ENOMEM;
+		}
+		entry->sync_cnt = ha->sync_cnt;
+		entry->refcount = ha->refcount;
+
+		list_add_tail(&entry->list, &snap->list);
+		__hw_addr_insert(snap, entry, addr_len);
+		snap->count++;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_snapshot);
+
+/**
+ *  __hw_addr_list_reconcile - sync snapshot changes back and free snapshots
+ *  @real_list: the real address list to update
+ *  @work: the working snapshot (modified by driver via __hw_addr_sync_dev)
+ *  @ref: the reference snapshot (untouched copy of original state)
+ *  @addr_len: length of addresses
+ *
+ *  Walks the reference snapshot and compares each entry against the work
+ *  snapshot to compute sync_cnt deltas. Applies those deltas to @real_list.
+ *  Frees both snapshots when done.
+ *  Caller must hold netif_addr_lock_bh.
+ */
+void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
+			      struct netdev_hw_addr_list *work,
+			      struct netdev_hw_addr_list *ref, int addr_len)
+{
+	struct netdev_hw_addr *ref_ha, *work_ha, *real_ha;
+	int delta;
+
+	list_for_each_entry(ref_ha, &ref->list, list) {
+		work_ha = __hw_addr_lookup(work, ref_ha->addr, addr_len,
+					   ref_ha->type);
+		if (work_ha)
+			delta = work_ha->sync_cnt - ref_ha->sync_cnt;
+		else
+			delta = -1;
+
+		if (delta == 0)
+			continue;
+
+		real_ha = __hw_addr_lookup(real_list, ref_ha->addr, addr_len,
+					   ref_ha->type);
+		if (!real_ha) {
+			/* The real entry was concurrently removed. If the
+			 * driver synced this addr to hardware (delta > 0),
+			 * re-insert it as a stale entry so the next work
+			 * run unsyncs it from hardware.
+			 */
+			if (delta > 0) {
+				real_ha = __hw_addr_create(ref_ha->addr,
+							   addr_len,
+							   ref_ha->type, false,
+							   false);
+				if (real_ha) {
+					real_ha->sync_cnt = 1;
+					real_ha->refcount = 1;
+					list_add_tail_rcu(&real_ha->list,
+							  &real_list->list);
+					__hw_addr_insert(real_list, real_ha,
+							 addr_len);
+					real_list->count++;
+				}
+			}
+			continue;
+		}
+
+		real_ha->sync_cnt += delta;
+		real_ha->refcount += delta;
+		if (!real_ha->refcount) {
+			rb_erase(&real_ha->node, &real_list->tree);
+			list_del_rcu(&real_ha->list);
+			kfree_rcu(real_ha, rcu_head);
+			real_list->count--;
+		}
+	}
+
+	__hw_addr_flush(work);
+	__hw_addr_flush(ref);
+}
+EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_reconcile);
+
 /*
  * Device addresses handling functions
  */
diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c
index 8e1dba825e94..ba8578e7845a 100644
--- a/net/core/dev_addr_lists_test.c
+++ b/net/core/dev_addr_lists_test.c
@@ -2,22 +2,31 @@
 
 #include <kunit/test.h>
 #include <linux/etherdevice.h>
+#include <linux/math64.h>
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
 
 static const struct net_device_ops dummy_netdev_ops = {
 };
 
+#define ADDR_A	1
+#define ADDR_B	2
+#define ADDR_C	3
+
 struct dev_addr_test_priv {
 	u32 addr_seen;
+	u32 addr_synced;
+	u32 addr_unsynced;
 };
 
 static int dev_addr_test_sync(struct net_device *netdev, const unsigned char *a)
 {
 	struct dev_addr_test_priv *datp = netdev_priv(netdev);
 
-	if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN))
+	if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) {
 		datp->addr_seen |= 1 << a[0];
+		datp->addr_synced |= 1 << a[0];
+	}
 	return 0;
 }
 
@@ -26,11 +35,22 @@ static int dev_addr_test_unsync(struct net_device *netdev,
 {
 	struct dev_addr_test_priv *datp = netdev_priv(netdev);
 
-	if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN))
+	if (a[0] < 31 && !memchr_inv(a, a[0], ETH_ALEN)) {
 		datp->addr_seen &= ~(1 << a[0]);
+		datp->addr_unsynced |= 1 << a[0];
+	}
 	return 0;
 }
 
+static void dev_addr_test_reset(struct net_device *netdev)
+{
+	struct dev_addr_test_priv *datp = netdev_priv(netdev);
+
+	datp->addr_seen = 0;
+	datp->addr_synced = 0;
+	datp->addr_unsynced = 0;
+}
+
 static int dev_addr_test_init(struct kunit *test)
 {
 	struct dev_addr_test_priv *datp;
@@ -225,6 +245,339 @@ static void dev_addr_test_add_excl(struct kunit *test)
 	rtnl_unlock();
 }
 
+/* Snapshot test: basic sync with no concurrent modifications.
+ * Add one address, snapshot, driver syncs it, reconcile propagates
+ * sync_cnt delta back to real list.
+ */
+static void dev_addr_test_snapshot_sync(struct kunit *test)
+{
+	struct net_device *netdev = test->priv;
+	struct netdev_hw_addr_list snap, ref;
+	struct dev_addr_test_priv *datp;
+	struct netdev_hw_addr *ha;
+	u8 addr[ETH_ALEN];
+
+	datp = netdev_priv(netdev);
+
+	rtnl_lock();
+
+	memset(addr, ADDR_A, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+
+	/* Snapshot: ADDR_A has sync_cnt=0, refcount=1 (new) */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_init(&snap);
+	__hw_addr_init(&ref);
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+	netif_addr_unlock_bh(netdev);
+
+	/* Driver syncs ADDR_A to hardware */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+
+	/* Reconcile: delta=+1 applied to real entry */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	netif_addr_unlock_bh(netdev);
+
+	/* Real entry should now reflect the sync: sync_cnt=1, refcount=2 */
+	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
+	ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list);
+	KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN);
+	KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt);
+	KUNIT_EXPECT_EQ(test, 2, ha->refcount);
+
+	/* Second work run: already synced, nothing to do */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
+
+	rtnl_unlock();
+}
+
+/* Snapshot test: ADDR_A synced to hardware, then concurrently removed
+ * from the real list before reconcile runs. Reconcile re-inserts ADDR_A as
+ * a stale entry so the next work run unsyncs it from hardware.
+ */
+static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
+{
+	struct net_device *netdev = test->priv;
+	struct netdev_hw_addr_list snap, ref;
+	struct dev_addr_test_priv *datp;
+	struct netdev_hw_addr *ha;
+	u8 addr[ETH_ALEN];
+
+	datp = netdev_priv(netdev);
+
+	rtnl_lock();
+
+	memset(addr, ADDR_A, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+
+	/* Snapshot: ADDR_A is new (sync_cnt=0, refcount=1) */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_init(&snap);
+	__hw_addr_init(&ref);
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+	netif_addr_unlock_bh(netdev);
+
+	/* Driver syncs ADDR_A to hardware */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+
+	/* Concurrent removal: user deletes ADDR_A while driver was working */
+	memset(addr, ADDR_A, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr));
+	KUNIT_EXPECT_EQ(test, 0, netdev->uc.count);
+
+	/* Reconcile: ADDR_A gone from real list but driver synced it,
+	 * so it gets re-inserted as stale (sync_cnt=1, refcount=1).
+	 */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	netif_addr_unlock_bh(netdev);
+
+	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
+	ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list);
+	KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN);
+	KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt);
+	KUNIT_EXPECT_EQ(test, 1, ha->refcount);
+
+	/* Second work run: stale entry gets unsynced from HW and removed */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced);
+	KUNIT_EXPECT_EQ(test, 0, netdev->uc.count);
+
+	rtnl_unlock();
+}
+
+/* Snapshot test: ADDR_A was stale (unsynced from hardware by driver),
+ * but concurrently re-added by the user. The re-add bumps refcount of
+ * the existing stale entry. Reconcile applies delta=-1, leaving ADDR_A
+ * as a fresh entry (sync_cnt=0, refcount=1) for the next work run.
+ */
+static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
+{
+	struct net_device *netdev = test->priv;
+	struct netdev_hw_addr_list snap, ref;
+	struct dev_addr_test_priv *datp;
+	struct netdev_hw_addr *ha;
+	u8 addr[ETH_ALEN];
+
+	datp = netdev_priv(netdev);
+
+	rtnl_lock();
+
+	memset(addr, ADDR_A, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+
+	/* Sync ADDR_A to hardware: sync_cnt=1, refcount=2 */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+
+	/* User removes ADDR_A: refcount=1, sync_cnt=1 -> stale */
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr));
+
+	/* Snapshot: ADDR_A is stale (sync_cnt=1, refcount=1) */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_init(&snap);
+	__hw_addr_init(&ref);
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+	netif_addr_unlock_bh(netdev);
+
+	/* Driver unsyncs stale ADDR_A from hardware */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced);
+
+	/* Concurrent: user re-adds ADDR_A.  dev_uc_add finds the existing
+	 * stale entry and bumps refcount from 1 -> 2.  sync_cnt stays 1.
+	 */
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
+
+	/* Reconcile: ref sync_cnt=1 matches real sync_cnt=1, delta=-1
+	 * applied. Result: sync_cnt=0, refcount=1 (fresh).
+	 */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	netif_addr_unlock_bh(netdev);
+
+	/* Entry survives as fresh: needs re-sync to HW */
+	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
+	ha = list_first_entry(&netdev->uc.list, struct netdev_hw_addr, list);
+	KUNIT_EXPECT_MEMEQ(test, ha->addr, addr, ETH_ALEN);
+	KUNIT_EXPECT_EQ(test, 0, ha->sync_cnt);
+	KUNIT_EXPECT_EQ(test, 1, ha->refcount);
+
+	/* Second work run: fresh entry gets synced to HW */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+
+	rtnl_unlock();
+}
+
+/* Snapshot test: ADDR_A is new (synced by driver), and independent ADDR_B
+ * is concurrently removed from the real list. A's sync delta propagates
+ * normally; B's absence doesn't interfere.
+ */
+static void dev_addr_test_snapshot_add_and_remove(struct kunit *test)
+{
+	struct net_device *netdev = test->priv;
+	struct netdev_hw_addr_list snap, ref;
+	struct dev_addr_test_priv *datp;
+	struct netdev_hw_addr *ha;
+	u8 addr[ETH_ALEN];
+
+	datp = netdev_priv(netdev);
+
+	rtnl_lock();
+
+	/* Add ADDR_A and ADDR_B (will be synced then removed) */
+	memset(addr, ADDR_A, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+	memset(addr, ADDR_B, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+
+	/* Sync both to hardware: sync_cnt=1, refcount=2 */
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+
+	/* Add ADDR_C (new, will be synced by snapshot) */
+	memset(addr, ADDR_C, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_add(netdev, addr));
+
+	/* Snapshot: A,B synced (sync_cnt=1,refcount=2); C new (0,1) */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_init(&snap);
+	__hw_addr_init(&ref);
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+	KUNIT_ASSERT_EQ(test, 0,
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+	netif_addr_unlock_bh(netdev);
+
+	/* Driver syncs snapshot: ADDR_C is new -> synced; A,B already synced */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&snap, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_C, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
+
+	/* Concurrent: user removes addr B while driver was working */
+	memset(addr, ADDR_B, sizeof(addr));
+	KUNIT_EXPECT_EQ(test, 0, dev_uc_del(netdev, addr));
+
+	/* Reconcile: ADDR_C's delta=+1 applied to real list.
+	 * ADDR_B's delta=0 (unchanged in snapshot),
+	 * so nothing to apply to ADDR_B.
+	 */
+	netif_addr_lock_bh(netdev);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	netif_addr_unlock_bh(netdev);
+
+	/* ADDR_A: unchanged (sync_cnt=1, refcount=2)
+	 * ADDR_B: refcount went from 2->1 via dev_uc_del (still present, stale)
+	 * ADDR_C: sync propagated (sync_cnt=1, refcount=2)
+	 */
+	KUNIT_EXPECT_EQ(test, 3, netdev->uc.count);
+	netdev_hw_addr_list_for_each(ha, &netdev->uc) {
+		u8 id = ha->addr[0];
+
+		if (!memchr_inv(ha->addr, id, ETH_ALEN)) {
+			if (id == ADDR_A) {
+				KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt);
+				KUNIT_EXPECT_EQ(test, 2, ha->refcount);
+			} else if (id == ADDR_B) {
+				/* B: still present but now stale */
+				KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt);
+				KUNIT_EXPECT_EQ(test, 1, ha->refcount);
+			} else if (id == ADDR_C) {
+				KUNIT_EXPECT_EQ(test, 1, ha->sync_cnt);
+				KUNIT_EXPECT_EQ(test, 2, ha->refcount);
+			}
+		}
+	}
+
+	/* Second work run: ADDR_B is stale, gets unsynced and removed */
+	dev_addr_test_reset(netdev);
+	__hw_addr_sync_dev(&netdev->uc, netdev, dev_addr_test_sync,
+			   dev_addr_test_unsync);
+	KUNIT_EXPECT_EQ(test, 0, datp->addr_synced);
+	KUNIT_EXPECT_EQ(test, 1 << ADDR_B, datp->addr_unsynced);
+	KUNIT_EXPECT_EQ(test, 2, netdev->uc.count);
+
+	rtnl_unlock();
+}
+
+static void dev_addr_test_snapshot_benchmark(struct kunit *test)
+{
+	struct net_device *netdev = test->priv;
+	struct netdev_hw_addr_list snap;
+	s64 duration = 0;
+	ktime_t start;
+	u8 addr[ETH_ALEN];
+	int i, iter;
+
+	rtnl_lock();
+
+	for (i = 0; i < 1024; i++) {
+		memset(addr, 0, sizeof(addr));
+		addr[0] = (i >> 8) & 0xff;
+		addr[1] = i & 0xff;
+		KUNIT_ASSERT_EQ(test, 0, dev_uc_add(netdev, addr));
+	}
+
+	for (iter = 0; iter < 1000; iter++) {
+		netif_addr_lock_bh(netdev);
+		__hw_addr_init(&snap);
+
+		start = ktime_get();
+		KUNIT_ASSERT_EQ(test, 0,
+				__hw_addr_list_snapshot(&snap, &netdev->uc,
+							ETH_ALEN));
+		duration += ktime_to_ns(ktime_sub(ktime_get(), start));
+
+		netif_addr_unlock_bh(netdev);
+		__hw_addr_flush(&snap);
+	}
+
+	kunit_info(test,
+		   "1024 addrs x 1000 snapshots: %lld ns total, %lld ns/iter",
+		   duration, div_s64(duration, 1000));
+
+	rtnl_unlock();
+}
+
 static struct kunit_case dev_addr_test_cases[] = {
 	KUNIT_CASE(dev_addr_test_basic),
 	KUNIT_CASE(dev_addr_test_sync_one),
@@ -232,6 +585,11 @@ static struct kunit_case dev_addr_test_cases[] = {
 	KUNIT_CASE(dev_addr_test_del_main),
 	KUNIT_CASE(dev_addr_test_add_set),
 	KUNIT_CASE(dev_addr_test_add_excl),
+	KUNIT_CASE(dev_addr_test_snapshot_sync),
+	KUNIT_CASE(dev_addr_test_snapshot_remove_during_sync),
+	KUNIT_CASE(dev_addr_test_snapshot_readd_during_unsync),
+	KUNIT_CASE(dev_addr_test_snapshot_add_and_remove),
+	KUNIT_CASE_SLOW(dev_addr_test_snapshot_benchmark),
 	{}
 };
 
@@ -243,5 +601,6 @@ static struct kunit_suite dev_addr_test_suite = {
 };
 kunit_test_suite(dev_addr_test_suite);
 
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
 MODULE_DESCRIPTION("KUnit tests for struct netdev_hw_addr_list");
 MODULE_LICENSE("GPL");
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure
  2026-04-02 22:55 ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure (123ac7a76378) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
    net: add address list snapshot and reconciliation infrastructure

This commit introduces address list snapshot and reconciliation routines
to support the async rx_mode path. It provides functions to snapshot the
lists, and to reconcile the hardware sync state back to the original lists
after driver processing.

> diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> --- a/net/core/dev_addr_lists.c
> +++ b/net/core/dev_addr_lists.c
[ ... ]
> @@ -501,6 +503,114 @@ void __hw_addr_init(struct netdev_hw_addr_list *list)
[ ... ]
> +void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
> +			      struct netdev_hw_addr_list *work,
> +			      struct netdev_hw_addr_list *ref, int addr_len)
> +{
[ ... ]
> +		real_ha = __hw_addr_lookup(real_list, ref_ha->addr, addr_len,
> +					   ref_ha->type);
> +		if (!real_ha) {
> +			/* The real entry was concurrently removed. If the
> +			 * driver synced this addr to hardware (delta > 0),
> +			 * re-insert it as a stale entry so the next work
> +			 * run unsyncs it from hardware.
> +			 */
> +			if (delta > 0) {
> +				real_ha = __hw_addr_create(ref_ha->addr,
> +							   addr_len,
> +							   ref_ha->type, false,
> +							   false);

Does this code silently leak a hardware MAC filter if the GFP_ATOMIC allocation
fails? If the software loses track of this address, it appears future sync runs
will never unsync it from the hardware.

Could this failure be avoided by unlinking ref_ha from the ref list (which is
about to be unconditionally freed via __hw_addr_flush) and moving it directly
into real_list?

> +				if (real_ha) {
> +					real_ha->sync_cnt = 1;
> +					real_ha->refcount = 1;

Does this code break compatibility with reference-counted sync APIs?

While __hw_addr_sync_dev() increments by 1, drivers processing snapshots with
__hw_addr_ref_sync_dev() might increment sync_cnt by the total number of
references. If such an entry is concurrently removed, recreation with a
hardcoded value of 1 could instruct the next async run to unsync only 1
reference, permanently leaking the remaining hardware references.

Should these be dynamically initialized to real_ha->sync_cnt = delta and
real_ha->refcount = delta?

> +					list_add_tail_rcu(&real_ha->list,
> +							  &real_list->list);
> +					__hw_addr_insert(real_list, real_ha,
> +							 addr_len);
> +					real_list->count++;
> +				}
> +			}
> +			continue;
> +		}
[ ... ]
> diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c
> --- a/net/core/dev_addr_lists_test.c
> +++ b/net/core/dev_addr_lists_test.c
[ ... ]
> @@ -225,6 +245,339 @@ static void dev_addr_test_add_excl(struct kunit *test)
[ ... ]
> +static void dev_addr_test_snapshot_sync(struct kunit *test)
> +{
[ ... ]
> +	netif_addr_lock_bh(netdev);
> +	__hw_addr_init(&snap);
> +	__hw_addr_init(&ref);
> +	KUNIT_ASSERT_EQ(test, 0,
> +			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
> +	KUNIT_ASSERT_EQ(test, 0,
> +			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
> +	netif_addr_unlock_bh(netdev);

Will this cause a system deadlock if the assertion fails?

If KUNIT_ASSERT_EQ triggers (e.g. from an out-of-memory condition in
__hw_addr_list_snapshot), it immediately aborts the test case thread and skips
the rest of the function, leaving the netif_addr_lock_bh spinlock permanently
held and bottom-halves disabled.

Could this be replaced with KUNIT_EXPECT_EQ and explicit goto error handling
to ensure netif_addr_unlock_bh is always reached?

[ ... ]
> +static void dev_addr_test_snapshot_benchmark(struct kunit *test)
> +{
[ ... ]
> +	for (iter = 0; iter < 1000; iter++) {
> +		netif_addr_lock_bh(netdev);
> +		__hw_addr_init(&snap);
> +
> +		start = ktime_get();
> +		KUNIT_ASSERT_EQ(test, 0,
> +				__hw_addr_list_snapshot(&snap, &netdev->uc,
> +							ETH_ALEN));
> +		duration += ktime_to_ns(ktime_sub(ktime_get(), start));
> +
> +		netif_addr_unlock_bh(netdev);

Will this also cause a deadlock if the assertion fails while the lock is held?
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure (123ac7a76378)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
>     net: add address list snapshot and reconciliation infrastructure
> 
> This commit introduces address list snapshot and reconciliation routines
> to support the async rx_mode path. It provides functions to snapshot the
> lists, and to reconcile the hardware sync state back to the original lists
> after driver processing.
> 
> > diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> > --- a/net/core/dev_addr_lists.c
> > +++ b/net/core/dev_addr_lists.c
> [ ... ]
> > @@ -501,6 +503,114 @@ void __hw_addr_init(struct netdev_hw_addr_list *list)
> [ ... ]
> > +void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
> > +			      struct netdev_hw_addr_list *work,
> > +			      struct netdev_hw_addr_list *ref, int addr_len)
> > +{
> [ ... ]
> > +		real_ha = __hw_addr_lookup(real_list, ref_ha->addr, addr_len,
> > +					   ref_ha->type);
> > +		if (!real_ha) {
> > +			/* The real entry was concurrently removed. If the
> > +			 * driver synced this addr to hardware (delta > 0),
> > +			 * re-insert it as a stale entry so the next work
> > +			 * run unsyncs it from hardware.
> > +			 */
> > +			if (delta > 0) {
> > +				real_ha = __hw_addr_create(ref_ha->addr,
> > +							   addr_len,
> > +							   ref_ha->type, false,
> > +							   false);
> 
> Does this code silently leak a hardware MAC filter if the GFP_ATOMIC allocation
> fails? If the software loses track of this address, it appears future sync runs
> will never unsync it from the hardware.
>
> Could this failure be avoided by unlinking ref_ha from the ref list (which is
> about to be unconditionally freed via __hw_addr_flush) and moving it directly
> into real_list?

Will unlink the ref_ha from ref list and relink, that should work.

> 
> > +				if (real_ha) {
> > +					real_ha->sync_cnt = 1;
> > +					real_ha->refcount = 1;
> 
> Does this code break compatibility with reference-counted sync APIs?
>
> While __hw_addr_sync_dev() increments by 1, drivers processing snapshots with
> __hw_addr_ref_sync_dev() might increment sync_cnt by the total number of
> references. If such an entry is concurrently removed, recreation with a
> hardcoded value of 1 could instruct the next async run to unsync only 1
> reference, permanently leaking the remaining hardware references.
>
> Should these be dynamically initialized to real_ha->sync_cnt = delta and
> real_ha->refcount = delta?

Idk, I did convince myself that the delta can be only 1 here and felt
like it's easier to follow, but will switch to sync_cnt=delta.

> 
> > +					list_add_tail_rcu(&real_ha->list,
> > +							  &real_list->list);
> > +					__hw_addr_insert(real_list, real_ha,
> > +							 addr_len);
> > +					real_list->count++;
> > +				}
> > +			}
> > +			continue;
> > +		}
> [ ... ]
> > diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c
> > --- a/net/core/dev_addr_lists_test.c
> > +++ b/net/core/dev_addr_lists_test.c
> [ ... ]
> > @@ -225,6 +245,339 @@ static void dev_addr_test_add_excl(struct kunit *test)
> [ ... ]
> > +static void dev_addr_test_snapshot_sync(struct kunit *test)
> > +{
> [ ... ]
> > +	netif_addr_lock_bh(netdev);
> > +	__hw_addr_init(&snap);
> > +	__hw_addr_init(&ref);
> > +	KUNIT_ASSERT_EQ(test, 0,
> > +			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
> > +	KUNIT_ASSERT_EQ(test, 0,
> > +			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
> > +	netif_addr_unlock_bh(netdev);
> 
> Will this cause a system deadlock if the assertion fails?
>
> If KUNIT_ASSERT_EQ triggers (e.g. from an out-of-memory condition in
> __hw_addr_list_snapshot), it immediately aborts the test case thread and skips
> the rest of the function, leaving the netif_addr_lock_bh spinlock permanently
> held and bottom-halves disabled.
>
> Could this be replaced with KUNIT_EXPECT_EQ and explicit goto error handling
> to ensure netif_addr_unlock_bh is always reached?

Will do.

> 
> [ ... ]
> > +static void dev_addr_test_snapshot_benchmark(struct kunit *test)
> > +{
> [ ... ]
> > +	for (iter = 0; iter < 1000; iter++) {
> > +		netif_addr_lock_bh(netdev);
> > +		__hw_addr_init(&snap);
> > +
> > +		start = ktime_get();
> > +		KUNIT_ASSERT_EQ(test, 0,
> > +				__hw_addr_list_snapshot(&snap, &netdev->uc,
> > +							ETH_ALEN));
> > +		duration += ktime_to_ns(ktime_sub(ktime_get(), start));
> > +
> > +		netif_addr_unlock_bh(netdev);
> 
> Will this also cause a deadlock if the assertion fails while the lock is held?

Same here, will do.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-04  0:27   ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work Stanislav Fomichev
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni

Add ndo_set_rx_mode_async callback that drivers can implement instead
of the legacy ndo_set_rx_mode. The legacy callback runs under the
netif_addr_lock spinlock with BHs disabled, preventing drivers from
sleeping. The async variant runs from a work queue with rtnl_lock and
netdev_lock_ops held, in fully sleepable context.

When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules
netdev_rx_mode_work instead of calling the driver inline. The work
function takes two snapshots of each address list (uc/mc) under
the addr_lock, then drops the lock and calls the driver with the
work copies. After the driver returns, it reconciles the snapshots
back to the real lists under the lock.

Add netif_rx_mode_sync() to opportunistically execute the pending
workqueue update inline, so that rx mode changes are committed
before returning to userspace:
  - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK)
  - dev_set_promiscuity
  - dev_set_allmulti
  - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI
  - do_setlink (RTM_SETLINK)

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 Documentation/networking/netdevices.rst |   9 ++
 include/linux/netdevice.h               |  16 ++
 net/core/dev.c                          |  46 +-----
 net/core/dev.h                          |   3 +
 net/core/dev_addr_lists.c               | 194 ++++++++++++++++++++++++
 net/core/dev_api.c                      |   3 +
 net/core/dev_ioctl.c                    |   6 +-
 net/core/rtnetlink.c                    |   1 +
 8 files changed, 235 insertions(+), 43 deletions(-)

diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index 35704d115312..8a488c21fd7c 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -289,6 +289,15 @@ struct net_device synchronization rules
 ndo_set_rx_mode:
 	Synchronization: netif_addr_lock spinlock.
 	Context: BHs disabled
+	Notes: Deprecated in favor of ndo_set_rx_mode_async which runs
+	in process context.
+
+ndo_set_rx_mode_async:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
+	Context: process (from a work queue)
+	Notes: Async version of ndo_set_rx_mode which runs in process
+	context. Receives snapshots of the unicast and multicast address lists.
 
 ndo_setup_tc:
 	``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 151d6f4fd9b3..cf3dd3f1eb0f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1119,6 +1119,16 @@ struct netdev_net_notifier {
  *	This function is called device changes address list filtering.
  *	If driver handles unicast address filtering, it should set
  *	IFF_UNICAST_FLT in its priv_flags.
+ *	Cannot sleep, called with netif_addr_lock_bh held.
+ *	Deprecated in favor of ndo_set_rx_mode_async.
+ *
+ * void (*ndo_set_rx_mode_async)(struct net_device *dev,
+ *				 struct netdev_hw_addr_list *uc,
+ *				 struct netdev_hw_addr_list *mc);
+ *	Async version of ndo_set_rx_mode which runs in process context
+ *	with rtnl_lock and netdev_lock_ops(dev) held. The uc/mc parameters
+ *	are snapshots of the address lists - iterate with
+ *	netdev_hw_addr_list_for_each(ha, uc).
  *
  * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
  *	This function  is called when the Media Access Control address
@@ -1439,6 +1449,10 @@ struct net_device_ops {
 	void			(*ndo_change_rx_flags)(struct net_device *dev,
 						       int flags);
 	void			(*ndo_set_rx_mode)(struct net_device *dev);
+	void			(*ndo_set_rx_mode_async)(
+					struct net_device *dev,
+					struct netdev_hw_addr_list *uc,
+					struct netdev_hw_addr_list *mc);
 	int			(*ndo_set_mac_address)(struct net_device *dev,
 						       void *addr);
 	int			(*ndo_validate_addr)(struct net_device *dev);
@@ -1903,6 +1917,7 @@ enum netdev_reg_state {
  *				has been enabled due to the need to listen to
  *				additional unicast addresses in a device that
  *				does not implement ndo_set_rx_mode()
+ *	@rx_mode_node:		List entry for rx_mode work processing
  *	@uc:			unicast mac addresses
  *	@mc:			multicast mac addresses
  *	@dev_addrs:		list of device hw addresses
@@ -2294,6 +2309,7 @@ struct net_device {
 	unsigned int		promiscuity;
 	unsigned int		allmulti;
 	bool			uc_promisc;
+	struct list_head	rx_mode_node;
 #ifdef CONFIG_LOCKDEP
 	unsigned char		nested_level;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 90d041390e29..b9f9e6a30963 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9574,7 +9574,7 @@ static void dev_change_rx_flags(struct net_device *dev, int flags)
 		ops->ndo_change_rx_flags(dev, flags);
 }
 
-static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
+int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
 {
 	unsigned int old_flags = dev->flags;
 	unsigned int promiscuity, flags;
@@ -9678,46 +9678,6 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
 	return 0;
 }
 
-/*
- *	Upload unicast and multicast address lists to device and
- *	configure RX filtering. When the device doesn't support unicast
- *	filtering it is put in promiscuous mode while unicast addresses
- *	are present.
- */
-void __dev_set_rx_mode(struct net_device *dev)
-{
-	const struct net_device_ops *ops = dev->netdev_ops;
-
-	/* dev_open will call this function so the list will stay sane. */
-	if (!(dev->flags&IFF_UP))
-		return;
-
-	if (!netif_device_present(dev))
-		return;
-
-	if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
-		/* Unicast addresses changes may only happen under the rtnl,
-		 * therefore calling __dev_set_promiscuity here is safe.
-		 */
-		if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
-			__dev_set_promiscuity(dev, 1, false);
-			dev->uc_promisc = true;
-		} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
-			__dev_set_promiscuity(dev, -1, false);
-			dev->uc_promisc = false;
-		}
-	}
-
-	if (ops->ndo_set_rx_mode)
-		ops->ndo_set_rx_mode(dev);
-}
-
-void dev_set_rx_mode(struct net_device *dev)
-{
-	netif_addr_lock_bh(dev);
-	__dev_set_rx_mode(dev);
-	netif_addr_unlock_bh(dev);
-}
 
 /**
  * netif_get_flags() - get flags reported to userspace
@@ -12108,6 +12068,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 #endif
 
 	mutex_init(&dev->lock);
+	INIT_LIST_HEAD(&dev->rx_mode_node);
 
 	dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
 	setup(dev);
@@ -12212,6 +12173,9 @@ void free_netdev(struct net_device *dev)
 
 	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
 
+	if (netif_rx_mode_clean(dev))
+		dev_put(dev);
+
 	/* Flush device addresses */
 	dev_addr_flush(dev);
 
diff --git a/net/core/dev.h b/net/core/dev.h
index acc925b7b337..50edb380ca94 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -146,6 +146,9 @@ int netif_change_carrier(struct net_device *dev, bool new_carrier);
 int dev_change_carrier(struct net_device *dev, bool new_carrier);
 
 void __dev_set_rx_mode(struct net_device *dev);
+int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify);
+bool netif_rx_mode_clean(struct net_device *dev);
+void netif_rx_mode_sync(struct net_device *dev);
 
 void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
 			unsigned int gchanges, u32 portid,
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 284fcfe9389b..a05db3ac43ea 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -11,10 +11,18 @@
 #include <linux/rtnetlink.h>
 #include <linux/export.h>
 #include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
 #include <kunit/visibility.h>
 
 #include "dev.h"
 
+static void netdev_rx_mode_work(struct work_struct *work);
+
+static LIST_HEAD(rx_mode_list);
+static DEFINE_SPINLOCK(rx_mode_lock);
+static DECLARE_WORK(rx_mode_work, netdev_rx_mode_work);
+
 /*
  * General list handling functions
  */
@@ -1159,3 +1167,189 @@ void dev_mc_init(struct net_device *dev)
 	__hw_addr_init(&dev->mc);
 }
 EXPORT_SYMBOL(dev_mc_init);
+
+static int netif_addr_lists_snapshot(struct net_device *dev,
+				     struct netdev_hw_addr_list *uc_snap,
+				     struct netdev_hw_addr_list *mc_snap,
+				     struct netdev_hw_addr_list *uc_ref,
+				     struct netdev_hw_addr_list *mc_ref)
+{
+	int err;
+
+	err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len);
+	if (!err)
+		err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len);
+	if (!err)
+		err = __hw_addr_list_snapshot(mc_snap, &dev->mc,
+					      dev->addr_len);
+	if (!err)
+		err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len);
+
+	if (err) {
+		__hw_addr_flush(uc_snap);
+		__hw_addr_flush(uc_ref);
+		__hw_addr_flush(mc_snap);
+	}
+
+	return err;
+}
+
+static void netif_addr_lists_reconcile(struct net_device *dev,
+				       struct netdev_hw_addr_list *uc_snap,
+				       struct netdev_hw_addr_list *mc_snap,
+				       struct netdev_hw_addr_list *uc_ref,
+				       struct netdev_hw_addr_list *mc_ref)
+{
+	__hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len);
+	__hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len);
+}
+
+static void netif_rx_mode_run(struct net_device *dev)
+{
+	struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref;
+	const struct net_device_ops *ops = dev->netdev_ops;
+	int err;
+
+	might_sleep();
+	netdev_ops_assert_locked(dev);
+
+	__hw_addr_init(&uc_snap);
+	__hw_addr_init(&mc_snap);
+	__hw_addr_init(&uc_ref);
+	__hw_addr_init(&mc_ref);
+
+	if (!(dev->flags & IFF_UP) || !netif_device_present(dev))
+		return;
+
+	netif_addr_lock_bh(dev);
+	err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap,
+					&uc_ref, &mc_ref);
+	if (err) {
+		netdev_WARN(dev, "failed to sync uc/mc addresses\n");
+		netif_addr_unlock_bh(dev);
+		return;
+	}
+	netif_addr_unlock_bh(dev);
+
+	ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap);
+
+	netif_addr_lock_bh(dev);
+	netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap,
+				   &uc_ref, &mc_ref);
+	netif_addr_unlock_bh(dev);
+}
+
+static void netdev_rx_mode_work(struct work_struct *work)
+{
+	struct net_device *dev;
+
+	rtnl_lock();
+
+	while (true) {
+		spin_lock_bh(&rx_mode_lock);
+		if (list_empty(&rx_mode_list)) {
+			spin_unlock_bh(&rx_mode_lock);
+			break;
+		}
+		dev = list_first_entry(&rx_mode_list, struct net_device,
+				       rx_mode_node);
+		list_del_init(&dev->rx_mode_node);
+		spin_unlock_bh(&rx_mode_lock);
+
+		netdev_lock_ops(dev);
+		netif_rx_mode_run(dev);
+		netdev_unlock_ops(dev);
+		dev_put(dev);
+	}
+
+	rtnl_unlock();
+}
+
+static void netif_rx_mode_queue(struct net_device *dev)
+{
+	spin_lock_bh(&rx_mode_lock);
+	if (list_empty(&dev->rx_mode_node)) {
+		list_add_tail(&dev->rx_mode_node, &rx_mode_list);
+		dev_hold(dev);
+	}
+	spin_unlock_bh(&rx_mode_lock);
+	schedule_work(&rx_mode_work);
+}
+
+/**
+ * __dev_set_rx_mode() - upload unicast and multicast address lists to device
+ * and configure RX filtering.
+ * @dev: device
+ *
+ * When the device doesn't support unicast filtering it is put in promiscuous
+ * mode while unicast addresses are present.
+ */
+void __dev_set_rx_mode(struct net_device *dev)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	/* dev_open will call this function so the list will stay sane. */
+	if (!(dev->flags & IFF_UP))
+		return;
+
+	if (!netif_device_present(dev))
+		return;
+
+	if (ops->ndo_set_rx_mode_async) {
+		netif_rx_mode_queue(dev);
+		return;
+	}
+
+	if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
+		if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
+			__dev_set_promiscuity(dev, 1, false);
+			dev->uc_promisc = true;
+		} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
+			__dev_set_promiscuity(dev, -1, false);
+			dev->uc_promisc = false;
+		}
+	}
+
+	if (ops->ndo_set_rx_mode)
+		ops->ndo_set_rx_mode(dev);
+}
+
+void dev_set_rx_mode(struct net_device *dev)
+{
+	netif_addr_lock_bh(dev);
+	__dev_set_rx_mode(dev);
+	netif_addr_unlock_bh(dev);
+}
+
+bool netif_rx_mode_clean(struct net_device *dev)
+{
+	bool clean = false;
+
+	spin_lock_bh(&rx_mode_lock);
+	if (!list_empty(&dev->rx_mode_node)) {
+		list_del_init(&dev->rx_mode_node);
+		clean = true;
+	}
+	spin_unlock_bh(&rx_mode_lock);
+
+	return clean;
+}
+
+/**
+ * netif_rx_mode_sync() - sync rx mode inline
+ * @dev: network device
+ *
+ * Drivers implementing ndo_set_rx_mode_async() have their rx mode callback
+ * executed from a workqueue. This allows the callback to sleep, but means
+ * the hardware update is deferred and may not be visible to userspace
+ * by the time the initiating syscall returns. netif_rx_mode_sync() steals
+ * workqueue update and executes it inline. This preserves the atomicity of
+ * operations to the userspace.
+ */
+void netif_rx_mode_sync(struct net_device *dev)
+{
+	if (netif_rx_mode_clean(dev)) {
+		netif_rx_mode_run(dev);
+		dev_put(dev);
+	}
+}
diff --git a/net/core/dev_api.c b/net/core/dev_api.c
index f28852078aa6..437947dd08ed 100644
--- a/net/core/dev_api.c
+++ b/net/core/dev_api.c
@@ -66,6 +66,7 @@ int dev_change_flags(struct net_device *dev, unsigned int flags,
 
 	netdev_lock_ops(dev);
 	ret = netif_change_flags(dev, flags, extack);
+	netif_rx_mode_sync(dev);
 	netdev_unlock_ops(dev);
 
 	return ret;
@@ -285,6 +286,7 @@ int dev_set_promiscuity(struct net_device *dev, int inc)
 
 	netdev_lock_ops(dev);
 	ret = netif_set_promiscuity(dev, inc);
+	netif_rx_mode_sync(dev);
 	netdev_unlock_ops(dev);
 
 	return ret;
@@ -311,6 +313,7 @@ int dev_set_allmulti(struct net_device *dev, int inc)
 
 	netdev_lock_ops(dev);
 	ret = netif_set_allmulti(dev, inc, true);
+	netif_rx_mode_sync(dev);
 	netdev_unlock_ops(dev);
 
 	return ret;
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 7a8966544c9d..f3979b276090 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -586,24 +586,26 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, void __user *data,
 		return err;
 
 	case SIOCADDMULTI:
-		if (!ops->ndo_set_rx_mode ||
+		if ((!ops->ndo_set_rx_mode && !ops->ndo_set_rx_mode_async) ||
 		    ifr->ifr_hwaddr.sa_family != AF_UNSPEC)
 			return -EINVAL;
 		if (!netif_device_present(dev))
 			return -ENODEV;
 		netdev_lock_ops(dev);
 		err = dev_mc_add_global(dev, ifr->ifr_hwaddr.sa_data);
+		netif_rx_mode_sync(dev);
 		netdev_unlock_ops(dev);
 		return err;
 
 	case SIOCDELMULTI:
-		if (!ops->ndo_set_rx_mode ||
+		if ((!ops->ndo_set_rx_mode && !ops->ndo_set_rx_mode_async) ||
 		    ifr->ifr_hwaddr.sa_family != AF_UNSPEC)
 			return -EINVAL;
 		if (!netif_device_present(dev))
 			return -ENODEV;
 		netdev_lock_ops(dev);
 		err = dev_mc_del_global(dev, ifr->ifr_hwaddr.sa_data);
+		netif_rx_mode_sync(dev);
 		netdev_unlock_ops(dev);
 		return err;
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index fae8034efbff..f4e5ac70709d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3431,6 +3431,7 @@ static int do_setlink(const struct sk_buff *skb, struct net_device *dev,
 					     dev->name);
 	}
 
+	netif_rx_mode_sync(dev);
 	netdev_unlock_ops(dev);
 
 	return err;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
  2026-04-02 22:55 ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2) sdf.kernel
  2026-04-04  0:27   ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Jakub Kicinski
  1 sibling, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work

This commit introduces an async variant of ndo_set_rx_mode that runs from a
workqueue, allowing drivers to sleep during RX mode changes. It also adds
netif_rx_mode_sync() to opportunistically flush pending updates before
returning to userspace.

> --- a/net/core/dev.c
> +++ b/net/core/dev.c
[ ... ]
> @@ -12217,6 +12178,9 @@ void free_netdev(struct net_device *dev)
>  
>  	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
>  
> +	if (netif_rx_mode_clean(dev))
> +		dev_put(dev);
> +
>  	/* Flush device addresses */
>  	dev_addr_flush(dev);

Is this cleanup reachable during normal shutdown?

When a device is queued to rx_mode_list, netif_rx_mode_queue() elevates its
reference count via dev_hold(). During unregistration, netdev_wait_allrefs()
blocks the system until all references are dropped.

Because of this wait, the system shouldn't be able to reach free_netdev()
until the worker thread has already dequeued the device and called
dev_put(). It seems the device is guaranteed to not be on the list by the
time this code executes.

> --- a/net/core/dev_addr_lists.c
> +++ b/net/core/dev_addr_lists.c
[ ... ]
> +static void netdev_rx_mode_work(struct work_struct *work)
> +{
> +	struct net_device *dev;
> +
> +	rtnl_lock();
> +
> +	while (true) {
> +		spin_lock_bh(&rx_mode_lock);
> +		if (list_empty(&rx_mode_list)) {
> +			spin_unlock_bh(&rx_mode_lock);
> +			break;
> +		}
> +		dev = list_first_entry(&rx_mode_list, struct net_device,
> +				       rx_mode_node);
> +		list_del_init(&dev->rx_mode_node);
> +		spin_unlock_bh(&rx_mode_lock);
> +
> +		netdev_lock_ops(dev);

Can this while loop starve the system of the rtnl lock?

Since userspace can trigger RX mode updates via ioctls like SIOCADDMULTI
while holding only netdev_lock_ops() and not the rtnl lock, a continuous
stream of these operations could repeatedly add devices to rx_mode_list.

Because the loop releases rx_mode_lock while processing but retains the rtnl
lock, could it run indefinitely if the list is continually replenished?

Additionally, does the order of list_del_init() and netdev_lock_ops() create
a race condition with netif_rx_mode_sync()?

If the worker thread dequeues the device and drops the spinlock, it might
then block waiting for netdev_lock_ops(). Meanwhile, an ioctl thread
holding netdev_lock_ops() might call netif_rx_mode_sync().

The ioctl thread would call netif_rx_mode_clean(), which checks
list_empty(). Since the worker already removed the device from the list,
clean returns false. This causes netif_rx_mode_sync() to act as a no-op,
allowing the ioctl to return to userspace before the hardware update
actually completes.

> --- a/net/core/dev_api.c
> +++ b/net/core/dev_api.c
[ ... ]
> @@ -311,6 +313,7 @@ int dev_set_allmulti(struct net_device *dev, int inc)
>  
>  	netdev_lock_ops(dev);
>  	ret = netif_set_allmulti(dev, inc, true);
> +	netif_rx_mode_sync(dev);
>  	netdev_unlock_ops(dev);
>  
>  	return ret;

Does this correctly handle cascaded updates for stacked devices?

When operating on a stacked device like a VLAN, the RX mode update cascades
to the underlying physical device, queuing the physical device to the
rx_mode_list via __dev_set_rx_mode().

Since netif_rx_mode_sync() is only called on the stacked device here, and
the stacked device itself is not on the rx_mode_list, will the physical
device's update be left on the workqueue? This seems to bypass the
synchronous update guarantee.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> Subject: net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
> 
> This commit introduces an async variant of ndo_set_rx_mode that runs from a
> workqueue, allowing drivers to sleep during RX mode changes. It also adds
> netif_rx_mode_sync() to opportunistically flush pending updates before
> returning to userspace.
> 
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> [ ... ]
> > @@ -12217,6 +12178,9 @@ void free_netdev(struct net_device *dev)
> >  
> >  	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
> >  
> > +	if (netif_rx_mode_clean(dev))
> > +		dev_put(dev);
> > +
> >  	/* Flush device addresses */
> >  	dev_addr_flush(dev);
>
> Is this cleanup reachable during normal shutdown?
>
> When a device is queued to rx_mode_list, netif_rx_mode_queue() elevates its
> reference count via dev_hold(). During unregistration, netdev_wait_allrefs()
> blocks the system until all references are dropped.
>
> Because of this wait, the system shouldn't be able to reach free_netdev()
> until the worker thread has already dequeued the device and called
> dev_put(). It seems the device is guaranteed to not be on the list by the
> time this code executes.

I guess, yeah, we can drop this. netdev_wait_allrefs waits for all refs
to go away, so this is not needed.

> 
> > --- a/net/core/dev_addr_lists.c
> > +++ b/net/core/dev_addr_lists.c
> [ ... ]
> > +static void netdev_rx_mode_work(struct work_struct *work)
> > +{
> > +	struct net_device *dev;
> > +
> > +	rtnl_lock();
> > +
> > +	while (true) {
> > +		spin_lock_bh(&rx_mode_lock);
> > +		if (list_empty(&rx_mode_list)) {
> > +			spin_unlock_bh(&rx_mode_lock);
> > +			break;
> > +		}
> > +		dev = list_first_entry(&rx_mode_list, struct net_device,
> > +				       rx_mode_node);
> > +		list_del_init(&dev->rx_mode_node);
> > +		spin_unlock_bh(&rx_mode_lock);
> > +
> > +		netdev_lock_ops(dev);
> 
> Can this while loop starve the system of the rtnl lock?
>
> Since userspace can trigger RX mode updates via ioctls like SIOCADDMULTI
> while holding only netdev_lock_ops() and not the rtnl lock, a continuous
> stream of these operations could repeatedly add devices to rx_mode_list.
>
> Because the loop releases rx_mode_lock while processing but retains the rtnl
> lock, could it run indefinitely if the list is continually replenished?

Idk, this seems overly paranoid, the same work would happen under rtnl
on the main thread if we don't do a async callback.

> Additionally, does the order of list_del_init() and netdev_lock_ops() create
> a race condition with netif_rx_mode_sync()?
>
> If the worker thread dequeues the device and drops the spinlock, it might
> then block waiting for netdev_lock_ops(). Meanwhile, an ioctl thread
> holding netdev_lock_ops() might call netif_rx_mode_sync().
>
> The ioctl thread would call netif_rx_mode_clean(), which checks
> list_empty(). Since the worker already removed the device from the list,
> clean returns false. This causes netif_rx_mode_sync() to act as a no-op,
> allowing the ioctl to return to userspace before the hardware update
> actually completes.

The worst thing that can happen is we re-queue empty work. I don't see
how we can return to userspace without having a sync update.

> 
> > --- a/net/core/dev_api.c
> > +++ b/net/core/dev_api.c
> [ ... ]
> > @@ -311,6 +313,7 @@ int dev_set_allmulti(struct net_device *dev, int inc)
> >  
> >  	netdev_lock_ops(dev);
> >  	ret = netif_set_allmulti(dev, inc, true);
> > +	netif_rx_mode_sync(dev);
> >  	netdev_unlock_ops(dev);
> >  
> >  	return ret;
> 
> Does this correctly handle cascaded updates for stacked devices?
>
> When operating on a stacked device like a VLAN, the RX mode update cascades
> to the underlying physical device, queuing the physical device to the
> rx_mode_list via __dev_set_rx_mode().
>
> Since netif_rx_mode_sync() is only called on the stacked device here, and
> the stacked device itself is not on the rx_mode_list, will the physical
> device's update be left on the workqueue? This seems to bypass the
> synchronous update guarantee.

Yes, this assessment is correct in general. Not sure we want some new
netif_rx_mode_deep_sync or (probably better?) add some netif_rx_mode_sync
calls where appropriate. For now, leaving netif_rx_mode_sync in a few
places and planning to add more netif_rx_mode_sync if/when issues with
deep hierarchy syncing arise.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work
  2026-04-02 22:55 ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-04  0:27   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2) sdf.kernel
  1 sibling, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:27 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, davem, edumazet, pabeni

On Thu,  2 Apr 2026 15:55:23 -0700 Stanislav Fomichev wrote:
> +	if (netif_rx_mode_clean(dev))
> +		dev_put(dev);

let's toss a netdev_tracker into this reference counting
but overall the work stealing is looking cleaner than I expected,
nicely done!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2)
  2026-04-04  0:27   ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, edumazet, pabeni

> On Thu,  2 Apr 2026 15:55:23 -0700 Stanislav Fomichev wrote:
> > +	if (netif_rx_mode_clean(dev))
> > +		dev_put(dev);
> 
> let's toss a netdev_tracker into this reference counting
> but overall the work stealing is looking cleaner than I expected,
> nicely done!

Sure, will do!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 04/14] net: cache snapshot entries for ndo_set_rx_mode_async Stanislav Fomichev
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, Aleksandr Loktionov

Move unicast promiscuity tracking into netdev_rx_mode_work so it runs
under netdev_ops_lock instead of under the addr_lock spinlock. This
is required because __dev_set_promiscuity calls dev_change_rx_flags
and __dev_notify_flags, both of which may need to sleep.

Change ASSERT_RTNL() to netdev_ops_assert_locked() in
__dev_set_promiscuity, netif_set_allmulti and __dev_change_flags
since these are now called from the work queue under the ops lock.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 Documentation/networking/netdevices.rst |  4 ++
 net/core/dev.c                          | 16 ++---
 net/core/dev_addr_lists.c               | 82 ++++++++++++++++++-------
 3 files changed, 68 insertions(+), 34 deletions(-)

diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index 8a488c21fd7c..390537323457 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -299,6 +299,10 @@ struct net_device synchronization rules
 	Notes: Async version of ndo_set_rx_mode which runs in process
 	context. Receives snapshots of the unicast and multicast address lists.
 
+ndo_change_rx_flags:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
+
 ndo_setup_tc:
 	``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks
 	(i.e. no ``rtnl_lock`` and no device instance lock). The rest of
diff --git a/net/core/dev.c b/net/core/dev.c
index b9f9e6a30963..ae1fd30d1ef9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9581,7 +9581,7 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
 	kuid_t uid;
 	kgid_t gid;
 
-	ASSERT_RTNL();
+	netdev_ops_assert_locked(dev);
 
 	promiscuity = dev->promiscuity + inc;
 	if (promiscuity == 0) {
@@ -9617,16 +9617,8 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
 
 		dev_change_rx_flags(dev, IFF_PROMISC);
 	}
-	if (notify) {
-		/* The ops lock is only required to ensure consistent locking
-		 * for `NETDEV_CHANGE` notifiers. This function is sometimes
-		 * called without the lock, even for devices that are ops
-		 * locked, such as in `dev_uc_sync_multiple` when using
-		 * bonding or teaming.
-		 */
-		netdev_ops_assert_locked(dev);
+	if (notify)
 		__dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL);
-	}
 	return 0;
 }
 
@@ -9648,7 +9640,7 @@ int netif_set_allmulti(struct net_device *dev, int inc, bool notify)
 	unsigned int old_flags = dev->flags, old_gflags = dev->gflags;
 	unsigned int allmulti, flags;
 
-	ASSERT_RTNL();
+	netdev_ops_assert_locked(dev);
 
 	allmulti = dev->allmulti + inc;
 	if (allmulti == 0) {
@@ -9716,7 +9708,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags,
 	unsigned int old_flags = dev->flags;
 	int ret;
 
-	ASSERT_RTNL();
+	netdev_ops_assert_locked(dev);
 
 	/*
 	 *	Set the flags on our device.
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index a05db3ac43ea..61589537b2d3 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -1204,10 +1204,34 @@ static void netif_addr_lists_reconcile(struct net_device *dev,
 	__hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len);
 }
 
+/**
+ * netif_uc_promisc_update() - evaluate whether uc_promisc should be toggled.
+ * @dev: device
+ *
+ * Must be called under netif_addr_lock_bh.
+ * Return: +1 to enter promisc, -1 to leave, 0 for no change.
+ */
+static int netif_uc_promisc_update(struct net_device *dev)
+{
+	if (dev->priv_flags & IFF_UNICAST_FLT)
+		return 0;
+
+	if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
+		dev->uc_promisc = true;
+		return 1;
+	}
+	if (netdev_uc_empty(dev) && dev->uc_promisc) {
+		dev->uc_promisc = false;
+		return -1;
+	}
+	return 0;
+}
+
 static void netif_rx_mode_run(struct net_device *dev)
 {
 	struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref;
 	const struct net_device_ops *ops = dev->netdev_ops;
+	int promisc_inc;
 	int err;
 
 	might_sleep();
@@ -1221,22 +1245,39 @@ static void netif_rx_mode_run(struct net_device *dev)
 	if (!(dev->flags & IFF_UP) || !netif_device_present(dev))
 		return;
 
-	netif_addr_lock_bh(dev);
-	err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap,
-					&uc_ref, &mc_ref);
-	if (err) {
-		netdev_WARN(dev, "failed to sync uc/mc addresses\n");
+	if (ops->ndo_set_rx_mode_async) {
+		netif_addr_lock_bh(dev);
+		err = netif_addr_lists_snapshot(dev, &uc_snap, &mc_snap,
+						&uc_ref, &mc_ref);
+		if (err) {
+			netdev_WARN(dev, "failed to sync uc/mc addresses\n");
+			netif_addr_unlock_bh(dev);
+			return;
+		}
+
+		promisc_inc = netif_uc_promisc_update(dev);
+		netif_addr_unlock_bh(dev);
+	} else {
+		netif_addr_lock_bh(dev);
+		promisc_inc = netif_uc_promisc_update(dev);
 		netif_addr_unlock_bh(dev);
-		return;
 	}
-	netif_addr_unlock_bh(dev);
 
-	ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap);
+	if (promisc_inc)
+		__dev_set_promiscuity(dev, promisc_inc, false);
 
-	netif_addr_lock_bh(dev);
-	netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap,
-				   &uc_ref, &mc_ref);
-	netif_addr_unlock_bh(dev);
+	if (ops->ndo_set_rx_mode_async) {
+		ops->ndo_set_rx_mode_async(dev, &uc_snap, &mc_snap);
+
+		netif_addr_lock_bh(dev);
+		netif_addr_lists_reconcile(dev, &uc_snap, &mc_snap,
+					   &uc_ref, &mc_ref);
+		netif_addr_unlock_bh(dev);
+	} else if (ops->ndo_set_rx_mode) {
+		netif_addr_lock_bh(dev);
+		ops->ndo_set_rx_mode(dev);
+		netif_addr_unlock_bh(dev);
+	}
 }
 
 static void netdev_rx_mode_work(struct work_struct *work)
@@ -1287,6 +1328,7 @@ static void netif_rx_mode_queue(struct net_device *dev)
 void __dev_set_rx_mode(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	int promisc_inc;
 
 	/* dev_open will call this function so the list will stay sane. */
 	if (!(dev->flags & IFF_UP))
@@ -1295,20 +1337,16 @@ void __dev_set_rx_mode(struct net_device *dev)
 	if (!netif_device_present(dev))
 		return;
 
-	if (ops->ndo_set_rx_mode_async) {
+	if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) {
 		netif_rx_mode_queue(dev);
 		return;
 	}
 
-	if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
-		if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
-			__dev_set_promiscuity(dev, 1, false);
-			dev->uc_promisc = true;
-		} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
-			__dev_set_promiscuity(dev, -1, false);
-			dev->uc_promisc = false;
-		}
-	}
+	/* Legacy path for non-ops-locked HW devices. */
+
+	promisc_inc = netif_uc_promisc_update(dev);
+	if (promisc_inc)
+		__dev_set_promiscuity(dev, promisc_inc, false);
 
 	if (ops->ndo_set_rx_mode)
 		ops->ndo_set_rx_mode(dev);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work
  2026-04-02 22:55 ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work (ddeab417d841) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: move promiscuity handling into netdev_rx_mode_work

This commit moves unicast promiscuity tracking into netdev_rx_mode_work so
it runs under netdev_ops_lock instead of the addr_lock spinlock. It also
changes ASSERT_RTNL() to netdev_ops_assert_locked() in several functions
called from the work queue.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0efac0c730077..08dac9c759996 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9586,7 +9586,7 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
>  	kuid_t uid;
>  	kgid_t gid;
>  
> -	ASSERT_RTNL();
> +	netdev_ops_assert_locked(dev);

Can this cause a lockdep assertion failure for unconverted ops-locked devices
using the legacy RX mode path?

Replacing ASSERT_RTNL() with netdev_ops_assert_locked(dev) asserts that
dev->lock is held for devices where netdev_need_ops_lock(dev) is true.

[ ... ]

> diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> index a05db3ac43ea9..61589537b2d3d 100644
> --- a/net/core/dev_addr_lists.c
> +++ b/net/core/dev_addr_lists.c
[ ... ]
> @@ -1295,20 +1337,16 @@ void __dev_set_rx_mode(struct net_device *dev)
>  	if (!netif_device_present(dev))
>  		return;
>  
> -	if (ops->ndo_set_rx_mode_async) {
> +	if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) {
>  		netif_rx_mode_queue(dev);
>  		return;
>  	}
>  
> -	if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
> -		if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
> -			__dev_set_promiscuity(dev, 1, false);
> -			dev->uc_promisc = true;
> -		} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
> -			__dev_set_promiscuity(dev, -1, false);
> -			dev->uc_promisc = false;
> -		}
> -	}
> +	/* Legacy path for non-ops-locked HW devices. */
> +
> +	promisc_inc = netif_uc_promisc_update(dev);
> +	if (promisc_inc)
> +		__dev_set_promiscuity(dev, promisc_inc, false);

If an ops-locked device has not yet been converted to implement the new
ndo_set_rx_mode_async or ndo_change_rx_flags callbacks, this legacy path
will be executed.

When a secondary unicast address is added to such a device (e.g., by
bringing up a macvlan interface over it) and the device lacks
IFF_UNICAST_FLT, netif_uc_promisc_update() returns 1. This triggers a
synchronous inline call to __dev_set_promiscuity(dev, 1, false).

Since __dev_set_rx_mode() is invoked from address list helpers like
dev_uc_add() which only hold rtnl_lock() and netif_addr_lock_bh(dev),
dev->lock is not held, which would trigger the new assertion in
__dev_set_promiscuity().

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work (ddeab417d841)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: move promiscuity handling into netdev_rx_mode_work
> 
> This commit moves unicast promiscuity tracking into netdev_rx_mode_work so
> it runs under netdev_ops_lock instead of the addr_lock spinlock. It also
> changes ASSERT_RTNL() to netdev_ops_assert_locked() in several functions
> called from the work queue.
> 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0efac0c730077..08dac9c759996 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9586,7 +9586,7 @@ int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
> >  	kuid_t uid;
> >  	kgid_t gid;
> >  
> > -	ASSERT_RTNL();
> > +	netdev_ops_assert_locked(dev);
> 
> Can this cause a lockdep assertion failure for unconverted ops-locked devices
> using the legacy RX mode path?

Correct and addressed later in the series in "net: warn ops-locked drivers
still using ndo_set_rx_mode". Don't think there is a better way to split
the patches...

> Replacing ASSERT_RTNL() with netdev_ops_assert_locked(dev) asserts that
> dev->lock is held for devices where netdev_need_ops_lock(dev) is true.
> 
> [ ... ]
> 
> > diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> > index a05db3ac43ea9..61589537b2d3d 100644
> > --- a/net/core/dev_addr_lists.c
> > +++ b/net/core/dev_addr_lists.c
> [ ... ]
> > @@ -1295,20 +1337,16 @@ void __dev_set_rx_mode(struct net_device *dev)
> >  	if (!netif_device_present(dev))
> >  		return;
> >  
> > -	if (ops->ndo_set_rx_mode_async) {
> > +	if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) {
> >  		netif_rx_mode_queue(dev);
> >  		return;
> >  	}
> >  
> > -	if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
> > -		if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
> > -			__dev_set_promiscuity(dev, 1, false);
> > -			dev->uc_promisc = true;
> > -		} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
> > -			__dev_set_promiscuity(dev, -1, false);
> > -			dev->uc_promisc = false;
> > -		}
> > -	}
> > +	/* Legacy path for non-ops-locked HW devices. */
> > +
> > +	promisc_inc = netif_uc_promisc_update(dev);
> > +	if (promisc_inc)
> > +		__dev_set_promiscuity(dev, promisc_inc, false);
> 
> If an ops-locked device has not yet been converted to implement the new
> ndo_set_rx_mode_async or ndo_change_rx_flags callbacks, this legacy path
> will be executed.

Same here, addressed later in the series in "net: warn ops-locked drivers
still using ndo_set_rx_mode".

> 
> When a secondary unicast address is added to such a device (e.g., by
> bringing up a macvlan interface over it) and the device lacks
> IFF_UNICAST_FLT, netif_uc_promisc_update() returns 1. This triggers a
> synchronous inline call to __dev_set_promiscuity(dev, 1, false).
> 
> Since __dev_set_rx_mode() is invoked from address list helpers like
> dev_uc_add() which only hold rtnl_lock() and netif_addr_lock_bh(dev),
> dev->lock is not held, which would trigger the new assertion in
> __dev_set_promiscuity().

Same.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 04/14] net: cache snapshot entries for ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async Stanislav Fomichev
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni

Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that
allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to
reuse previously allocated entries instead of hitting GFP_ATOMIC on
every snapshot cycle.

snapshot pops entries from the cache when available, falling back to
__hw_addr_create(). reconcile splices both snapshot lists back into
the cache via __hw_addr_splice(). The cache is flushed in
free_netdev().

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 include/linux/netdevice.h      |  7 ++--
 net/core/dev.c                 |  2 ++
 net/core/dev_addr_lists.c      | 66 ++++++++++++++++++++++++----------
 net/core/dev_addr_lists_test.c | 60 +++++++++++++++++++++----------
 4 files changed, 96 insertions(+), 39 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf3dd3f1eb0f..8f8fa72c2c64 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1918,6 +1918,7 @@ enum netdev_reg_state {
  *				additional unicast addresses in a device that
  *				does not implement ndo_set_rx_mode()
  *	@rx_mode_node:		List entry for rx_mode work processing
+ *	@rx_mode_addr_cache:	Recycled snapshot entries for rx_mode work
  *	@uc:			unicast mac addresses
  *	@mc:			multicast mac addresses
  *	@dev_addrs:		list of device hw addresses
@@ -2310,6 +2311,7 @@ struct net_device {
 	unsigned int		allmulti;
 	bool			uc_promisc;
 	struct list_head	rx_mode_node;
+	struct netdev_hw_addr_list	rx_mode_addr_cache;
 #ifdef CONFIG_LOCKDEP
 	unsigned char		nested_level;
 #endif
@@ -5014,10 +5016,11 @@ void __hw_addr_init(struct netdev_hw_addr_list *list);
 void __hw_addr_flush(struct netdev_hw_addr_list *list);
 int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap,
 			    const struct netdev_hw_addr_list *list,
-			    int addr_len);
+			    int addr_len, struct netdev_hw_addr_list *cache);
 void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
 			      struct netdev_hw_addr_list *work,
-			      struct netdev_hw_addr_list *ref, int addr_len);
+			      struct netdev_hw_addr_list *ref, int addr_len,
+			      struct netdev_hw_addr_list *cache);
 
 /* Functions used for device addresses handling */
 void dev_addr_mod(struct net_device *dev, unsigned int offset,
diff --git a/net/core/dev.c b/net/core/dev.c
index ae1fd30d1ef9..3ddf347dcdd7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12061,6 +12061,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	mutex_init(&dev->lock);
 	INIT_LIST_HEAD(&dev->rx_mode_node);
+	__hw_addr_init(&dev->rx_mode_addr_cache);
 
 	dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
 	setup(dev);
@@ -12167,6 +12168,7 @@ void free_netdev(struct net_device *dev)
 
 	if (netif_rx_mode_clean(dev))
 		dev_put(dev);
+	__hw_addr_flush(&dev->rx_mode_addr_cache);
 
 	/* Flush device addresses */
 	dev_addr_flush(dev);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 61589537b2d3..2cff791ce374 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -511,30 +511,50 @@ void __hw_addr_init(struct netdev_hw_addr_list *list)
 }
 EXPORT_SYMBOL(__hw_addr_init);
 
+static void __hw_addr_splice(struct netdev_hw_addr_list *dst,
+			     struct netdev_hw_addr_list *src)
+{
+	src->tree = RB_ROOT;
+	list_splice_init(&src->list, &dst->list);
+	dst->count += src->count;
+	src->count = 0;
+}
+
 /**
  *  __hw_addr_list_snapshot - create a snapshot copy of an address list
  *  @snap: destination snapshot list (needs to be __hw_addr_init-initialized)
  *  @list: source address list to snapshot
  *  @addr_len: length of addresses
+ *  @cache: entry cache to reuse entries from; falls back to GFP_ATOMIC
  *
- *  Creates a copy of @list with individually allocated entries suitable
- *  for use with __hw_addr_sync_dev() and other list manipulation helpers.
- *  Each entry is allocated with GFP_ATOMIC; must be called under a spinlock.
+ *  Creates a copy of @list reusing entries from @cache when available.
+ *  Must be called under a spinlock.
  *
  *  Return: 0 on success, -errno on failure.
  */
 int __hw_addr_list_snapshot(struct netdev_hw_addr_list *snap,
 			    const struct netdev_hw_addr_list *list,
-			    int addr_len)
+			    int addr_len, struct netdev_hw_addr_list *cache)
 {
 	struct netdev_hw_addr *ha, *entry;
 
 	list_for_each_entry(ha, &list->list, list) {
-		entry = __hw_addr_create(ha->addr, addr_len, ha->type,
-					 false, false);
-		if (!entry) {
-			__hw_addr_flush(snap);
-			return -ENOMEM;
+		if (cache->count) {
+			entry = list_first_entry(&cache->list,
+						 struct netdev_hw_addr, list);
+			list_del(&entry->list);
+			cache->count--;
+			memcpy(entry->addr, ha->addr, addr_len);
+			entry->type = ha->type;
+			entry->global_use = false;
+			entry->synced = 0;
+		} else {
+			entry = __hw_addr_create(ha->addr, addr_len, ha->type,
+						 false, false);
+			if (!entry) {
+				__hw_addr_flush(snap);
+				return -ENOMEM;
+			}
 		}
 		entry->sync_cnt = ha->sync_cnt;
 		entry->refcount = ha->refcount;
@@ -554,15 +574,17 @@ EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_snapshot);
  *  @work: the working snapshot (modified by driver via __hw_addr_sync_dev)
  *  @ref: the reference snapshot (untouched copy of original state)
  *  @addr_len: length of addresses
+ *  @cache: entry cache to return snapshot entries to for reuse
  *
  *  Walks the reference snapshot and compares each entry against the work
  *  snapshot to compute sync_cnt deltas. Applies those deltas to @real_list.
- *  Frees both snapshots when done.
+ *  Returns snapshot entries to @cache for reuse; frees both snapshots.
  *  Caller must hold netif_addr_lock_bh.
  */
 void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
 			      struct netdev_hw_addr_list *work,
-			      struct netdev_hw_addr_list *ref, int addr_len)
+			      struct netdev_hw_addr_list *ref, int addr_len,
+			      struct netdev_hw_addr_list *cache)
 {
 	struct netdev_hw_addr *ref_ha, *work_ha, *real_ha;
 	int delta;
@@ -614,8 +636,8 @@ void __hw_addr_list_reconcile(struct netdev_hw_addr_list *real_list,
 		}
 	}
 
-	__hw_addr_flush(work);
-	__hw_addr_flush(ref);
+	__hw_addr_splice(cache, work);
+	__hw_addr_splice(cache, ref);
 }
 EXPORT_SYMBOL_IF_KUNIT(__hw_addr_list_reconcile);
 
@@ -1176,14 +1198,18 @@ static int netif_addr_lists_snapshot(struct net_device *dev,
 {
 	int err;
 
-	err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len);
+	err = __hw_addr_list_snapshot(uc_snap, &dev->uc, dev->addr_len,
+				      &dev->rx_mode_addr_cache);
 	if (!err)
-		err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len);
+		err = __hw_addr_list_snapshot(uc_ref, &dev->uc, dev->addr_len,
+					      &dev->rx_mode_addr_cache);
 	if (!err)
 		err = __hw_addr_list_snapshot(mc_snap, &dev->mc,
-					      dev->addr_len);
+					      dev->addr_len,
+					      &dev->rx_mode_addr_cache);
 	if (!err)
-		err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len);
+		err = __hw_addr_list_snapshot(mc_ref, &dev->mc, dev->addr_len,
+					      &dev->rx_mode_addr_cache);
 
 	if (err) {
 		__hw_addr_flush(uc_snap);
@@ -1200,8 +1226,10 @@ static void netif_addr_lists_reconcile(struct net_device *dev,
 				       struct netdev_hw_addr_list *uc_ref,
 				       struct netdev_hw_addr_list *mc_ref)
 {
-	__hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len);
-	__hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len);
+	__hw_addr_list_reconcile(&dev->uc, uc_snap, uc_ref, dev->addr_len,
+				 &dev->rx_mode_addr_cache);
+	__hw_addr_list_reconcile(&dev->mc, mc_snap, mc_ref, dev->addr_len,
+				 &dev->rx_mode_addr_cache);
 }
 
 /**
diff --git a/net/core/dev_addr_lists_test.c b/net/core/dev_addr_lists_test.c
index ba8578e7845a..0ccb7a7de894 100644
--- a/net/core/dev_addr_lists_test.c
+++ b/net/core/dev_addr_lists_test.c
@@ -251,8 +251,8 @@ static void dev_addr_test_add_excl(struct kunit *test)
  */
 static void dev_addr_test_snapshot_sync(struct kunit *test)
 {
+	struct netdev_hw_addr_list snap, ref, cache;
 	struct net_device *netdev = test->priv;
-	struct netdev_hw_addr_list snap, ref;
 	struct dev_addr_test_priv *datp;
 	struct netdev_hw_addr *ha;
 	u8 addr[ETH_ALEN];
@@ -268,10 +268,13 @@ static void dev_addr_test_snapshot_sync(struct kunit *test)
 	netif_addr_lock_bh(netdev);
 	__hw_addr_init(&snap);
 	__hw_addr_init(&ref);
+	__hw_addr_init(&cache);
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN,
+						&cache));
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN,
+						&cache));
 	netif_addr_unlock_bh(netdev);
 
 	/* Driver syncs ADDR_A to hardware */
@@ -283,7 +286,8 @@ static void dev_addr_test_snapshot_sync(struct kunit *test)
 
 	/* Reconcile: delta=+1 applied to real entry */
 	netif_addr_lock_bh(netdev);
-	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN,
+				 &cache);
 	netif_addr_unlock_bh(netdev);
 
 	/* Real entry should now reflect the sync: sync_cnt=1, refcount=2 */
@@ -301,6 +305,7 @@ static void dev_addr_test_snapshot_sync(struct kunit *test)
 	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
 	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
 
+	__hw_addr_flush(&cache);
 	rtnl_unlock();
 }
 
@@ -310,8 +315,8 @@ static void dev_addr_test_snapshot_sync(struct kunit *test)
  */
 static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
 {
+	struct netdev_hw_addr_list snap, ref, cache;
 	struct net_device *netdev = test->priv;
-	struct netdev_hw_addr_list snap, ref;
 	struct dev_addr_test_priv *datp;
 	struct netdev_hw_addr *ha;
 	u8 addr[ETH_ALEN];
@@ -327,10 +332,13 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
 	netif_addr_lock_bh(netdev);
 	__hw_addr_init(&snap);
 	__hw_addr_init(&ref);
+	__hw_addr_init(&cache);
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN,
+						&cache));
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN,
+						&cache));
 	netif_addr_unlock_bh(netdev);
 
 	/* Driver syncs ADDR_A to hardware */
@@ -349,7 +357,8 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
 	 * so it gets re-inserted as stale (sync_cnt=1, refcount=1).
 	 */
 	netif_addr_lock_bh(netdev);
-	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN,
+				 &cache);
 	netif_addr_unlock_bh(netdev);
 
 	KUNIT_EXPECT_EQ(test, 1, netdev->uc.count);
@@ -366,6 +375,7 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
 	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_unsynced);
 	KUNIT_EXPECT_EQ(test, 0, netdev->uc.count);
 
+	__hw_addr_flush(&cache);
 	rtnl_unlock();
 }
 
@@ -376,8 +386,8 @@ static void dev_addr_test_snapshot_remove_during_sync(struct kunit *test)
  */
 static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
 {
+	struct netdev_hw_addr_list snap, ref, cache;
 	struct net_device *netdev = test->priv;
-	struct netdev_hw_addr_list snap, ref;
 	struct dev_addr_test_priv *datp;
 	struct netdev_hw_addr *ha;
 	u8 addr[ETH_ALEN];
@@ -403,10 +413,13 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
 	netif_addr_lock_bh(netdev);
 	__hw_addr_init(&snap);
 	__hw_addr_init(&ref);
+	__hw_addr_init(&cache);
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN,
+						&cache));
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN,
+						&cache));
 	netif_addr_unlock_bh(netdev);
 
 	/* Driver unsyncs stale ADDR_A from hardware */
@@ -426,7 +439,8 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
 	 * applied. Result: sync_cnt=0, refcount=1 (fresh).
 	 */
 	netif_addr_lock_bh(netdev);
-	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN,
+				 &cache);
 	netif_addr_unlock_bh(netdev);
 
 	/* Entry survives as fresh: needs re-sync to HW */
@@ -443,6 +457,7 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
 	KUNIT_EXPECT_EQ(test, 1 << ADDR_A, datp->addr_synced);
 	KUNIT_EXPECT_EQ(test, 0, datp->addr_unsynced);
 
+	__hw_addr_flush(&cache);
 	rtnl_unlock();
 }
 
@@ -452,8 +467,8 @@ static void dev_addr_test_snapshot_readd_during_unsync(struct kunit *test)
  */
 static void dev_addr_test_snapshot_add_and_remove(struct kunit *test)
 {
+	struct netdev_hw_addr_list snap, ref, cache;
 	struct net_device *netdev = test->priv;
-	struct netdev_hw_addr_list snap, ref;
 	struct dev_addr_test_priv *datp;
 	struct netdev_hw_addr *ha;
 	u8 addr[ETH_ALEN];
@@ -480,10 +495,13 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test)
 	netif_addr_lock_bh(netdev);
 	__hw_addr_init(&snap);
 	__hw_addr_init(&ref);
+	__hw_addr_init(&cache);
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&snap, &netdev->uc, ETH_ALEN,
+						&cache));
 	KUNIT_ASSERT_EQ(test, 0,
-			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN));
+			__hw_addr_list_snapshot(&ref, &netdev->uc, ETH_ALEN,
+						&cache));
 	netif_addr_unlock_bh(netdev);
 
 	/* Driver syncs snapshot: ADDR_C is new -> synced; A,B already synced */
@@ -502,7 +520,8 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test)
 	 * so nothing to apply to ADDR_B.
 	 */
 	netif_addr_lock_bh(netdev);
-	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN);
+	__hw_addr_list_reconcile(&netdev->uc, &snap, &ref, ETH_ALEN,
+				 &cache);
 	netif_addr_unlock_bh(netdev);
 
 	/* ADDR_A: unchanged (sync_cnt=1, refcount=2)
@@ -536,13 +555,14 @@ static void dev_addr_test_snapshot_add_and_remove(struct kunit *test)
 	KUNIT_EXPECT_EQ(test, 1 << ADDR_B, datp->addr_unsynced);
 	KUNIT_EXPECT_EQ(test, 2, netdev->uc.count);
 
+	__hw_addr_flush(&cache);
 	rtnl_unlock();
 }
 
 static void dev_addr_test_snapshot_benchmark(struct kunit *test)
 {
 	struct net_device *netdev = test->priv;
-	struct netdev_hw_addr_list snap;
+	struct netdev_hw_addr_list snap, cache;
 	s64 duration = 0;
 	ktime_t start;
 	u8 addr[ETH_ALEN];
@@ -557,6 +577,8 @@ static void dev_addr_test_snapshot_benchmark(struct kunit *test)
 		KUNIT_ASSERT_EQ(test, 0, dev_uc_add(netdev, addr));
 	}
 
+	__hw_addr_init(&cache);
+
 	for (iter = 0; iter < 1000; iter++) {
 		netif_addr_lock_bh(netdev);
 		__hw_addr_init(&snap);
@@ -564,13 +586,15 @@ static void dev_addr_test_snapshot_benchmark(struct kunit *test)
 		start = ktime_get();
 		KUNIT_ASSERT_EQ(test, 0,
 				__hw_addr_list_snapshot(&snap, &netdev->uc,
-							ETH_ALEN));
+							ETH_ALEN, &cache));
 		duration += ktime_to_ns(ktime_sub(ktime_get(), start));
 
 		netif_addr_unlock_bh(netdev);
 		__hw_addr_flush(&snap);
 	}
 
+	__hw_addr_flush(&cache);
+
 	kunit_info(test,
 		   "1024 addrs x 1000 snapshots: %lld ns total, %lld ns/iter",
 		   duration, div_s64(duration, 1000));
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 04/14] net: cache snapshot entries for ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async Stanislav Fomichev
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, Alexander Duyck, kernel-team,
	Aleksandr Loktionov

Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async. The
driver's __fbnic_set_rx_mode() now takes explicit uc/mc list
parameters and uses __hw_addr_sync_dev() on the snapshots instead
of __dev_uc_sync/__dev_mc_sync on the netdev directly.

Update callers in fbnic_up, fbnic_fw_config_after_crash,
fbnic_bmc_rpc_check and fbnic_set_mac to pass the real address
lists calling __fbnic_set_rx_mode outside the async work path.

Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: kernel-team@meta.com
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 .../net/ethernet/meta/fbnic/fbnic_netdev.c    | 20 ++++++++++++-------
 .../net/ethernet/meta/fbnic/fbnic_netdev.h    |  4 +++-
 drivers/net/ethernet/meta/fbnic/fbnic_pci.c   |  4 ++--
 drivers/net/ethernet/meta/fbnic/fbnic_rpc.c   |  2 +-
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
index b4b396ca9bce..c406a3b56b37 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
@@ -183,7 +183,9 @@ static int fbnic_mc_unsync(struct net_device *netdev, const unsigned char *addr)
 	return ret;
 }
 
-void __fbnic_set_rx_mode(struct fbnic_dev *fbd)
+void __fbnic_set_rx_mode(struct fbnic_dev *fbd,
+			 struct netdev_hw_addr_list *uc,
+			 struct netdev_hw_addr_list *mc)
 {
 	bool uc_promisc = false, mc_promisc = false;
 	struct net_device *netdev = fbd->netdev;
@@ -213,10 +215,10 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd)
 	}
 
 	/* Synchronize unicast and multicast address lists */
-	err = __dev_uc_sync(netdev, fbnic_uc_sync, fbnic_uc_unsync);
+	err = __hw_addr_sync_dev(uc, netdev, fbnic_uc_sync, fbnic_uc_unsync);
 	if (err == -ENOSPC)
 		uc_promisc = true;
-	err = __dev_mc_sync(netdev, fbnic_mc_sync, fbnic_mc_unsync);
+	err = __hw_addr_sync_dev(mc, netdev, fbnic_mc_sync, fbnic_mc_unsync);
 	if (err == -ENOSPC)
 		mc_promisc = true;
 
@@ -238,18 +240,21 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd)
 	fbnic_write_tce_tcam(fbd);
 }
 
-static void fbnic_set_rx_mode(struct net_device *netdev)
+static void fbnic_set_rx_mode(struct net_device *netdev,
+			      struct netdev_hw_addr_list *uc,
+			      struct netdev_hw_addr_list *mc)
 {
 	struct fbnic_net *fbn = netdev_priv(netdev);
 	struct fbnic_dev *fbd = fbn->fbd;
 
 	/* No need to update the hardware if we are not running */
 	if (netif_running(netdev))
-		__fbnic_set_rx_mode(fbd);
+		__fbnic_set_rx_mode(fbd, uc, mc);
 }
 
 static int fbnic_set_mac(struct net_device *netdev, void *p)
 {
+	struct fbnic_net *fbn = netdev_priv(netdev);
 	struct sockaddr *addr = p;
 
 	if (!is_valid_ether_addr(addr->sa_data))
@@ -257,7 +262,8 @@ static int fbnic_set_mac(struct net_device *netdev, void *p)
 
 	eth_hw_addr_set(netdev, addr->sa_data);
 
-	fbnic_set_rx_mode(netdev);
+	if (netif_running(netdev))
+		__fbnic_set_rx_mode(fbn->fbd, &netdev->uc, &netdev->mc);
 
 	return 0;
 }
@@ -551,7 +557,7 @@ static const struct net_device_ops fbnic_netdev_ops = {
 	.ndo_features_check	= fbnic_features_check,
 	.ndo_set_mac_address	= fbnic_set_mac,
 	.ndo_change_mtu		= fbnic_change_mtu,
-	.ndo_set_rx_mode	= fbnic_set_rx_mode,
+	.ndo_set_rx_mode_async	= fbnic_set_rx_mode,
 	.ndo_get_stats64	= fbnic_get_stats64,
 	.ndo_bpf		= fbnic_bpf,
 	.ndo_hwtstamp_get	= fbnic_hwtstamp_get,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h
index 9129a658f8fa..eded20b0e9e4 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.h
@@ -97,7 +97,9 @@ void fbnic_time_init(struct fbnic_net *fbn);
 int fbnic_time_start(struct fbnic_net *fbn);
 void fbnic_time_stop(struct fbnic_net *fbn);
 
-void __fbnic_set_rx_mode(struct fbnic_dev *fbd);
+void __fbnic_set_rx_mode(struct fbnic_dev *fbd,
+			 struct netdev_hw_addr_list *uc,
+			 struct netdev_hw_addr_list *mc);
 void fbnic_clear_rx_mode(struct fbnic_dev *fbd);
 
 void fbnic_phylink_get_pauseparam(struct net_device *netdev,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
index e3aebbe3656d..6b139cf54256 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
@@ -135,7 +135,7 @@ void fbnic_up(struct fbnic_net *fbn)
 
 	fbnic_rss_reinit_hw(fbn->fbd, fbn);
 
-	__fbnic_set_rx_mode(fbn->fbd);
+	__fbnic_set_rx_mode(fbn->fbd, &fbn->netdev->uc, &fbn->netdev->mc);
 
 	/* Enable Tx/Rx processing */
 	fbnic_napi_enable(fbn);
@@ -180,7 +180,7 @@ static int fbnic_fw_config_after_crash(struct fbnic_dev *fbd)
 	}
 
 	fbnic_rpc_reset_valid_entries(fbd);
-	__fbnic_set_rx_mode(fbd);
+	__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
index 42a186db43ea..fe95b6f69646 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
@@ -244,7 +244,7 @@ void fbnic_bmc_rpc_check(struct fbnic_dev *fbd)
 
 	if (fbd->fw_cap.need_bmc_tcam_reinit) {
 		fbnic_bmc_rpc_init(fbd);
-		__fbnic_set_rx_mode(fbd);
+		__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);
 		fbd->fw_cap.need_bmc_tcam_reinit = false;
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async (1d5e76c60ed0) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
fbnic: convert to ndo_set_rx_mode_async

Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async, updating
the driver to take explicit uc/mc lists and using __hw_addr_sync_dev()
on the snapshots instead of __dev_uc_sync() on the netdev directly.

> diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
> index b4b396ca9bce3..c406a3b56b37f 100644
> --- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
> +++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
[ ... ]
> @@ -213,10 +215,10 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd)
>  	}
>  
>  	/* Synchronize unicast and multicast address lists */
> -	err = __dev_uc_sync(netdev, fbnic_uc_sync, fbnic_uc_unsync);
> +	err = __hw_addr_sync_dev(uc, netdev, fbnic_uc_sync, fbnic_uc_unsync);
>  	if (err == -ENOSPC)
>  		uc_promisc = true;
> -	err = __dev_mc_sync(netdev, fbnic_mc_sync, fbnic_mc_unsync);
> +	err = __hw_addr_sync_dev(mc, netdev, fbnic_mc_sync, fbnic_mc_unsync);

Since __fbnic_set_rx_mode() performs read-modify-write operations on the
internal fbd->mac_addr array and updates the device TCAM, is there a need
for internal synchronization?

It appears that concurrent out-of-band callers like fbnic_bmc_rpc_check()
(from an unlocked workqueue), RTNL paths like fbnic_set_mac(), and the async
RX mode workqueue can race on the internal fbd->mac_addr array and interleave
writes to the hardware, potentially breaking the RX filtering rules.

>  	if (err == -ENOSPC)
>  		mc_promisc = true;
[ ... ]
> @@ -257,7 +262,8 @@ static int fbnic_set_mac(struct net_device *netdev, void *p)
>  
>  	eth_hw_addr_set(netdev, addr->sa_data);
>  
> -	fbnic_set_rx_mode(netdev);
> +	if (netif_running(netdev))
> +		__fbnic_set_rx_mode(fbn->fbd, &netdev->uc, &netdev->mc);

Is it safe to pass the live &netdev->uc and &netdev->mc address lists to
__hw_addr_sync_dev() without holding netif_addr_lock_bh() or rtnl_lock()?

This bypasses the core network stack's lock, which could lead to list
corruption and use-after-free if the network stack concurrently modifies
the address lists (e.g., via dev_mc_add()).

Additionally, doesn't manually mutating sync_cnt on the live lists bypass
the core's snapshot reconciliation state machine?

>  
>  	return 0;
>  }
> diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
> index e3aebbe3656d5..6b139cf542567 100644
> --- a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
> +++ b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
[ ... ]
> @@ -180,7 +180,7 @@ static int fbnic_fw_config_after_crash(struct fbnic_dev *fbd)
>  	}
>  
>  	fbnic_rpc_reset_valid_entries(fbd);
> -	__fbnic_set_rx_mode(fbd);
> +	__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);

Similarly here, fbnic_fw_config_after_crash() appears to be executed from the
unlocked fbnic_service_task workqueue. Could this cause lockless data races on
the live network stack lists?

>  
>  	return 0;
>  }
> diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
> index 42a186db43ea9..fe95b6f696463 100644
> --- a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
> +++ b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
[ ... ]
> @@ -244,7 +244,7 @@ void fbnic_bmc_rpc_check(struct fbnic_dev *fbd)
>  
>  	if (fbd->fw_cap.need_bmc_tcam_reinit) {
>  		fbnic_bmc_rpc_init(fbd);
> -		__fbnic_set_rx_mode(fbd);
> +		__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);

And here as well, passing live address lists without synchronization.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async (1d5e76c60ed0)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> fbnic: convert to ndo_set_rx_mode_async
> 
> Convert fbnic from ndo_set_rx_mode to ndo_set_rx_mode_async, updating
> the driver to take explicit uc/mc lists and using __hw_addr_sync_dev()
> on the snapshots instead of __dev_uc_sync() on the netdev directly.
> 
> > diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
> > index b4b396ca9bce3..c406a3b56b37f 100644
> > --- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
> > +++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
> [ ... ]
> > @@ -213,10 +215,10 @@ void __fbnic_set_rx_mode(struct fbnic_dev *fbd)
> >  	}
> >  
> >  	/* Synchronize unicast and multicast address lists */
> > -	err = __dev_uc_sync(netdev, fbnic_uc_sync, fbnic_uc_unsync);
> > +	err = __hw_addr_sync_dev(uc, netdev, fbnic_uc_sync, fbnic_uc_unsync);
> >  	if (err == -ENOSPC)
> >  		uc_promisc = true;
> > -	err = __dev_mc_sync(netdev, fbnic_mc_sync, fbnic_mc_unsync);
> > +	err = __hw_addr_sync_dev(mc, netdev, fbnic_mc_sync, fbnic_mc_unsync);
> 
> Since __fbnic_set_rx_mode() performs read-modify-write operations on the
> internal fbd->mac_addr array and updates the device TCAM, is there a need
> for internal synchronization?
>
> It appears that concurrent out-of-band callers like fbnic_bmc_rpc_check()
> (from an unlocked workqueue), RTNL paths like fbnic_set_mac(), and the async
> RX mode workqueue can race on the internal fbd->mac_addr array and interleave
> writes to the hardware, potentially breaking the RX filtering rules.

This is bogus, both paths are rntl/ops protested.

> 
> >  	if (err == -ENOSPC)
> >  		mc_promisc = true;
> [ ... ]
> > @@ -257,7 +262,8 @@ static int fbnic_set_mac(struct net_device *netdev, void *p)
> >  
> >  	eth_hw_addr_set(netdev, addr->sa_data);
> >  
> > -	fbnic_set_rx_mode(netdev);
> > +	if (netif_running(netdev))
> > +		__fbnic_set_rx_mode(fbn->fbd, &netdev->uc, &netdev->mc);
> 
> Is it safe to pass the live &netdev->uc and &netdev->mc address lists to
> __hw_addr_sync_dev() without holding netif_addr_lock_bh() or rtnl_lock()?
>
> This bypasses the core network stack's lock, which could lead to list
> corruption and use-after-free if the network stack concurrently modifies
> the address lists (e.g., via dev_mc_add()).
>
> Additionally, doesn't manually mutating sync_cnt on the live lists bypass
> the core's snapshot reconciliation state machine?

This is the same as before, nothing new added here (and, again, under
rtnl/ops lock).

> 
> >  
> >  	return 0;
> >  }
> > diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
> > index e3aebbe3656d5..6b139cf542567 100644
> > --- a/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
> > +++ b/drivers/net/ethernet/meta/fbnic/fbnic_pci.c
> [ ... ]
> > @@ -180,7 +180,7 @@ static int fbnic_fw_config_after_crash(struct fbnic_dev *fbd)
> >  	}
> >  
> >  	fbnic_rpc_reset_valid_entries(fbd);
> > -	__fbnic_set_rx_mode(fbd);
> > +	__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);
> 
> Similarly here, fbnic_fw_config_after_crash() appears to be executed from the
> unlocked fbnic_service_task workqueue. Could this cause lockless data races on
> the live network stack lists?

Nothing is lockless here, everything is under rtnl at least.

> 
> >  
> >  	return 0;
> >  }
> > diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
> > index 42a186db43ea9..fe95b6f696463 100644
> > --- a/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
> > +++ b/drivers/net/ethernet/meta/fbnic/fbnic_rpc.c
> [ ... ]
> > @@ -244,7 +244,7 @@ void fbnic_bmc_rpc_check(struct fbnic_dev *fbd)
> >  
> >  	if (fbd->fw_cap.need_bmc_tcam_reinit) {
> >  		fbnic_bmc_rpc_init(fbd);
> > -		__fbnic_set_rx_mode(fbd);
> > +		__fbnic_set_rx_mode(fbd, &fbd->netdev->uc, &fbd->netdev->mc);
> 
> And here as well, passing live address lists without synchronization.

Same here, nothing technically changed and this runs under rtnl.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async Stanislav Fomichev
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, Saeed Mahameed, Tariq Toukan,
	Cosmin Ratiu, Aleksandr Loktionov

Convert mlx5 from ndo_set_rx_mode to ndo_set_rx_mode_async. The
driver's mlx5e_set_rx_mode now receives uc/mc snapshots and calls
mlx5e_fs_set_rx_mode_work directly instead of queueing work.

mlx5e_sync_netdev_addr and mlx5e_handle_netdev_addr now take
explicit uc/mc list parameters and iterate with
netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr.

Fallback to netdev's uc/mc in a few places and grab addr lock.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |  5 ++-
 .../net/ethernet/mellanox/mlx5/core/en_fs.c   | 32 ++++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 13 +++++---
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
index c3408b3f7010..091b80a67189 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
@@ -201,7 +201,10 @@ int mlx5e_add_vlan_trap(struct mlx5e_flow_steering *fs, int  trap_id, int tir_nu
 void mlx5e_remove_vlan_trap(struct mlx5e_flow_steering *fs);
 int mlx5e_add_mac_trap(struct mlx5e_flow_steering *fs, int  trap_id, int tir_num);
 void mlx5e_remove_mac_trap(struct mlx5e_flow_steering *fs);
-void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs, struct net_device *netdev);
+void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs,
+			       struct net_device *netdev,
+			       struct netdev_hw_addr_list *uc,
+			       struct netdev_hw_addr_list *mc);
 int mlx5e_fs_vlan_rx_add_vid(struct mlx5e_flow_steering *fs,
 			     struct net_device *netdev,
 			     __be16 proto, u16 vid);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index fdfe9d1cfe21..12492c4a5d41 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -609,20 +609,26 @@ static void mlx5e_execute_l2_action(struct mlx5e_flow_steering *fs,
 }
 
 static void mlx5e_sync_netdev_addr(struct mlx5e_flow_steering *fs,
-				   struct net_device *netdev)
+				   struct net_device *netdev,
+				   struct netdev_hw_addr_list *uc,
+				   struct netdev_hw_addr_list *mc)
 {
 	struct netdev_hw_addr *ha;
 
-	netif_addr_lock_bh(netdev);
+	if (!uc || !mc) {
+		netif_addr_lock_bh(netdev);
+		mlx5e_sync_netdev_addr(fs, netdev, &netdev->uc, &netdev->mc);
+		netif_addr_unlock_bh(netdev);
+		return;
+	}
 
 	mlx5e_add_l2_to_hash(fs->l2.netdev_uc, netdev->dev_addr);
-	netdev_for_each_uc_addr(ha, netdev)
+
+	netdev_hw_addr_list_for_each(ha, uc)
 		mlx5e_add_l2_to_hash(fs->l2.netdev_uc, ha->addr);
 
-	netdev_for_each_mc_addr(ha, netdev)
+	netdev_hw_addr_list_for_each(ha, mc)
 		mlx5e_add_l2_to_hash(fs->l2.netdev_mc, ha->addr);
-
-	netif_addr_unlock_bh(netdev);
 }
 
 static void mlx5e_fill_addr_array(struct mlx5e_flow_steering *fs, int list_type,
@@ -724,7 +730,9 @@ static void mlx5e_apply_netdev_addr(struct mlx5e_flow_steering *fs)
 }
 
 static void mlx5e_handle_netdev_addr(struct mlx5e_flow_steering *fs,
-				     struct net_device *netdev)
+				     struct net_device *netdev,
+				     struct netdev_hw_addr_list *uc,
+				     struct netdev_hw_addr_list *mc)
 {
 	struct mlx5e_l2_hash_node *hn;
 	struct hlist_node *tmp;
@@ -736,7 +744,7 @@ static void mlx5e_handle_netdev_addr(struct mlx5e_flow_steering *fs,
 		hn->action = MLX5E_ACTION_DEL;
 
 	if (fs->state_destroy)
-		mlx5e_sync_netdev_addr(fs, netdev);
+		mlx5e_sync_netdev_addr(fs, netdev, uc, mc);
 
 	mlx5e_apply_netdev_addr(fs);
 }
@@ -820,13 +828,15 @@ static void mlx5e_destroy_promisc_table(struct mlx5e_flow_steering *fs)
 }
 
 void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs,
-			       struct net_device *netdev)
+			       struct net_device *netdev,
+			       struct netdev_hw_addr_list *uc,
+			       struct netdev_hw_addr_list *mc)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5e_l2_table *ea = &fs->l2;
 
 	if (mlx5e_is_uplink_rep(priv)) {
-		mlx5e_handle_netdev_addr(fs, netdev);
+		mlx5e_handle_netdev_addr(fs, netdev, uc, mc);
 		goto update_vport_context;
 	}
 
@@ -856,7 +866,7 @@ void mlx5e_fs_set_rx_mode_work(struct mlx5e_flow_steering *fs,
 	if (enable_broadcast)
 		mlx5e_add_l2_flow_rule(fs, &ea->broadcast, MLX5E_FULLMATCH);
 
-	mlx5e_handle_netdev_addr(fs, netdev);
+	mlx5e_handle_netdev_addr(fs, netdev, uc, mc);
 
 	if (disable_broadcast)
 		mlx5e_del_l2_flow_rule(fs, &ea->broadcast);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1238e5356012..45f5bacef255 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4105,11 +4105,13 @@ static void mlx5e_nic_set_rx_mode(struct mlx5e_priv *priv)
 	queue_work(priv->wq, &priv->set_rx_mode_work);
 }
 
-static void mlx5e_set_rx_mode(struct net_device *dev)
+static void mlx5e_set_rx_mode(struct net_device *dev,
+			      struct netdev_hw_addr_list *uc,
+			      struct netdev_hw_addr_list *mc)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
 
-	mlx5e_nic_set_rx_mode(priv);
+	mlx5e_fs_set_rx_mode_work(priv->fs, dev, uc, mc);
 }
 
 static int mlx5e_set_mac(struct net_device *netdev, void *addr)
@@ -5284,7 +5286,7 @@ const struct net_device_ops mlx5e_netdev_ops = {
 	.ndo_setup_tc            = mlx5e_setup_tc,
 	.ndo_select_queue        = mlx5e_select_queue,
 	.ndo_get_stats64         = mlx5e_get_stats,
-	.ndo_set_rx_mode         = mlx5e_set_rx_mode,
+	.ndo_set_rx_mode_async   = mlx5e_set_rx_mode,
 	.ndo_set_mac_address     = mlx5e_set_mac,
 	.ndo_vlan_rx_add_vid     = mlx5e_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid    = mlx5e_vlan_rx_kill_vid,
@@ -6269,8 +6271,11 @@ void mlx5e_set_rx_mode_work(struct work_struct *work)
 {
 	struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
 					       set_rx_mode_work);
+	struct net_device *dev = priv->netdev;
 
-	return mlx5e_fs_set_rx_mode_work(priv->fs, priv->netdev);
+	netdev_lock_ops(dev);
+	mlx5e_fs_set_rx_mode_work(priv->fs, dev, NULL, NULL);
+	netdev_unlock_ops(dev);
 }
 
 /* mlx5e generic netdev management API (move to en_common.c) */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async (3691f90f6593) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
mlx5: convert to ndo_set_rx_mode_async

This commit converts the mlx5 driver to use ndo_set_rx_mode_async. It passes
the uc and mc snapshots directly to the internal work handlers instead of
queueing work and manually iterating the address lists.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> @@ -609,20 +609,26 @@ static void mlx5e_execute_l2_action(struct mlx5e_flow_steering *fs,
>  }
>  
>  static void mlx5e_sync_netdev_addr(struct mlx5e_flow_steering *fs,
> -				   struct net_device *netdev)
> +				   struct net_device *netdev,
> +				   struct netdev_hw_addr_list *uc,
> +				   struct netdev_hw_addr_list *mc)
>  {
>  	struct netdev_hw_addr *ha;
>  
> -	netif_addr_lock_bh(netdev);
> +	if (!uc || !mc) {
> +		netif_addr_lock_bh(netdev);
> +		mlx5e_sync_netdev_addr(fs, netdev, &netdev->uc, &netdev->mc);
> +		netif_addr_unlock_bh(netdev);
> +		return;
> +	}
>  
>  	mlx5e_add_l2_to_hash(fs->l2.netdev_uc, netdev->dev_addr);

Can this lead to a torn read of the MAC address?

When uc and mc are provided, this path now reads netdev->dev_addr without
holding netif_addr_lock_bh().

Because ndo_set_rx_mode_async runs asynchronously in a workqueue, this read
could happen concurrently with a MAC address change in mlx5e_set_mac().
Since mlx5e_set_mac() updates the address using eth_hw_addr_set() under
netif_addr_lock_bh(), a concurrent read here might be torn and program a
corrupted MAC address into the hardware filters.

[ ... ]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async (3691f90f6593)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> mlx5: convert to ndo_set_rx_mode_async
> 
> This commit converts the mlx5 driver to use ndo_set_rx_mode_async. It passes
> the uc and mc snapshots directly to the internal work handlers instead of
> queueing work and manually iterating the address lists.
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
> > @@ -609,20 +609,26 @@ static void mlx5e_execute_l2_action(struct mlx5e_flow_steering *fs,
> >  }
> >  
> >  static void mlx5e_sync_netdev_addr(struct mlx5e_flow_steering *fs,
> > -				   struct net_device *netdev)
> > +				   struct net_device *netdev,
> > +				   struct netdev_hw_addr_list *uc,
> > +				   struct netdev_hw_addr_list *mc)
> >  {
> >  	struct netdev_hw_addr *ha;
> >  
> > -	netif_addr_lock_bh(netdev);
> > +	if (!uc || !mc) {
> > +		netif_addr_lock_bh(netdev);
> > +		mlx5e_sync_netdev_addr(fs, netdev, &netdev->uc, &netdev->mc);
> > +		netif_addr_unlock_bh(netdev);
> > +		return;
> > +	}
> >  
> >  	mlx5e_add_l2_to_hash(fs->l2.netdev_uc, netdev->dev_addr);
> 
> Can this lead to a torn read of the MAC address?
>
> When uc and mc are provided, this path now reads netdev->dev_addr without
> holding netif_addr_lock_bh().
>
> Because ndo_set_rx_mode_async runs asynchronously in a workqueue, this read
> could happen concurrently with a MAC address change in mlx5e_set_mac().
> Since mlx5e_set_mac() updates the address using eth_hw_addr_set() under
> netif_addr_lock_bh(), a concurrent read here might be torn and program a
> corrupted MAC address into the hardware filters.

All existing netif_set_mac_address paths (and, hence, mlx5e_set_mac) run under
rtnl/ops lock.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode Stanislav Fomichev
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, Michael Chan, Pavan Chebbi,
	Aleksandr Loktionov

Convert bnxt from ndo_set_rx_mode to ndo_set_rx_mode_async.
bnxt_set_rx_mode, bnxt_mc_list_updated and bnxt_uc_list_updated
now take explicit uc/mc list parameters and iterate with
netdev_hw_addr_list_for_each instead of netdev_for_each_{uc,mc}_addr.

The bnxt_cfg_rx_mode internal caller passes the real lists under
netif_addr_lock_bh.

BNXT_RX_MASK_SP_EVENT is still used here, next patch converts to
the direct call.

Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 31 +++++++++++++----------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8ec611bc01ee..11956c7dde23 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11033,7 +11033,8 @@ static int bnxt_setup_nitroa0_vnic(struct bnxt *bp)
 }
 
 static int bnxt_cfg_rx_mode(struct bnxt *);
-static bool bnxt_mc_list_updated(struct bnxt *, u32 *);
+static bool bnxt_mc_list_updated(struct bnxt *, u32 *,
+				 const struct netdev_hw_addr_list *);
 
 static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init)
 {
@@ -11123,7 +11124,7 @@ static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init)
 	} else if (bp->dev->flags & IFF_MULTICAST) {
 		u32 mask = 0;
 
-		bnxt_mc_list_updated(bp, &mask);
+		bnxt_mc_list_updated(bp, &mask, &bp->dev->mc);
 		vnic->rx_mask |= mask;
 	}
 
@@ -13512,17 +13513,17 @@ void bnxt_get_ring_drv_stats(struct bnxt *bp,
 		bnxt_get_one_ring_drv_stats(bp, stats, &bp->bnapi[i]->cp_ring);
 }
 
-static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask)
+static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask,
+				 const struct netdev_hw_addr_list *mc)
 {
 	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
-	struct net_device *dev = bp->dev;
 	struct netdev_hw_addr *ha;
 	u8 *haddr;
 	int mc_count = 0;
 	bool update = false;
 	int off = 0;
 
-	netdev_for_each_mc_addr(ha, dev) {
+	netdev_hw_addr_list_for_each(ha, mc) {
 		if (mc_count >= BNXT_MAX_MC_ADDRS) {
 			*rx_mask |= CFA_L2_SET_RX_MASK_REQ_MASK_ALL_MCAST;
 			vnic->mc_list_count = 0;
@@ -13546,17 +13547,17 @@ static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask)
 	return update;
 }
 
-static bool bnxt_uc_list_updated(struct bnxt *bp)
+static bool bnxt_uc_list_updated(struct bnxt *bp,
+				 const struct netdev_hw_addr_list *uc)
 {
-	struct net_device *dev = bp->dev;
 	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
 	struct netdev_hw_addr *ha;
 	int off = 0;
 
-	if (netdev_uc_count(dev) != (vnic->uc_filter_count - 1))
+	if (netdev_hw_addr_list_count(uc) != (vnic->uc_filter_count - 1))
 		return true;
 
-	netdev_for_each_uc_addr(ha, dev) {
+	netdev_hw_addr_list_for_each(ha, uc) {
 		if (!ether_addr_equal(ha->addr, vnic->uc_list + off))
 			return true;
 
@@ -13565,7 +13566,9 @@ static bool bnxt_uc_list_updated(struct bnxt *bp)
 	return false;
 }
 
-static void bnxt_set_rx_mode(struct net_device *dev)
+static void bnxt_set_rx_mode(struct net_device *dev,
+			     struct netdev_hw_addr_list *uc,
+			     struct netdev_hw_addr_list *mc)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_vnic_info *vnic;
@@ -13586,7 +13589,7 @@ static void bnxt_set_rx_mode(struct net_device *dev)
 	if (dev->flags & IFF_PROMISC)
 		mask |= CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS;
 
-	uc_update = bnxt_uc_list_updated(bp);
+	uc_update = bnxt_uc_list_updated(bp, uc);
 
 	if (dev->flags & IFF_BROADCAST)
 		mask |= CFA_L2_SET_RX_MASK_REQ_MASK_BCAST;
@@ -13594,7 +13597,7 @@ static void bnxt_set_rx_mode(struct net_device *dev)
 		mask |= CFA_L2_SET_RX_MASK_REQ_MASK_ALL_MCAST;
 		vnic->mc_list_count = 0;
 	} else if (dev->flags & IFF_MULTICAST) {
-		mc_update = bnxt_mc_list_updated(bp, &mask);
+		mc_update = bnxt_mc_list_updated(bp, &mask, mc);
 	}
 
 	if (mask != vnic->rx_mask || uc_update || mc_update) {
@@ -13613,7 +13616,7 @@ static int bnxt_cfg_rx_mode(struct bnxt *bp)
 	bool uc_update;
 
 	netif_addr_lock_bh(dev);
-	uc_update = bnxt_uc_list_updated(bp);
+	uc_update = bnxt_uc_list_updated(bp, &dev->uc);
 	netif_addr_unlock_bh(dev);
 
 	if (!uc_update)
@@ -15864,7 +15867,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
 	.ndo_start_xmit		= bnxt_start_xmit,
 	.ndo_stop		= bnxt_close,
 	.ndo_get_stats64	= bnxt_get_stats64,
-	.ndo_set_rx_mode	= bnxt_set_rx_mode,
+	.ndo_set_rx_mode_async	= bnxt_set_rx_mode,
 	.ndo_eth_ioctl		= bnxt_ioctl,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= bnxt_change_mac_addr,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async (c1776bbe53ec) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
bnxt: convert to ndo_set_rx_mode_async

This commit converts the bnxt driver to use ndo_set_rx_mode_async, updating
the rx mode callbacks to take explicit list parameters and iterate over them.

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index fe8b886ff82ea..569c7de7c5e44 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c

[ ... ]

> @@ -13523,17 +13524,17 @@ void bnxt_get_ring_drv_stats(struct bnxt *bp,
>  		bnxt_get_one_ring_drv_stats(bp, stats, &bp->bnapi[i]->cp_ring);
>  }
>  
> -static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask)
> +static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask,
> +				 const struct netdev_hw_addr_list *mc)
>  {
>  	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
> -	struct net_device *dev = bp->dev;
>  	struct netdev_hw_addr *ha;
>  	u8 *haddr;
>  	int mc_count = 0;
>  	bool update = false;
>  	int off = 0;
>  
> -	netdev_for_each_mc_addr(ha, dev) {
> +	netdev_hw_addr_list_for_each(ha, mc) {

When iterating over the list, bnxt_mc_list_updated() performs a memcpy()
into vnic->mc_list.

If this executes asynchronously while bnxt_cfg_rx_mode() is mapping
vnic->mc_list for DMA to the firmware via bnxt_hwrm_cfa_l2_set_rx_mask(),
could the NIC hardware read torn or corrupted multicast MAC addresses?

Even if a later patch addresses the explicit list TOCTOU issue, is a
driver-level spinlock needed here to serialize vnic state access between
the two concurrent workqueues?

[ ... ]

> @@ -15881,7 +15884,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
>  	.ndo_start_xmit		= bnxt_start_xmit,
>  	.ndo_stop		= bnxt_close,
>  	.ndo_get_stats64	= bnxt_get_stats64,
> -	.ndo_set_rx_mode	= bnxt_set_rx_mode,
> +	.ndo_set_rx_mode_async	= bnxt_set_rx_mode,

Since this callback now runs asynchronously without the netif_addr_lock_bh()
that the core networking stack previously held, does this introduce data races
on the driver state?

With bnxt_set_rx_mode() executing locklessly in the netdev_rx_mode_work
workqueue, and bnxt_cfg_rx_mode() concurrently modifying vnic->uc_list,
vnic->uc_filter_count, and vnic->rx_mask in the driver's bnxt_sp_task
workqueue, can they race when accessing the vnic state?

For example, could lockless assignments to vnic->rx_mask in bnxt_set_rx_mode()
temporarily drop hardware fallback states (like the PROMISCUOUS bit) set by
bnxt_cfg_rx_mode()?

>  	.ndo_eth_ioctl		= bnxt_ioctl,
>  	.ndo_validate_addr	= eth_validate_addr,
>  	.ndo_set_mac_address	= bnxt_change_mac_addr,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async (c1776bbe53ec)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> bnxt: convert to ndo_set_rx_mode_async
> 
> This commit converts the bnxt driver to use ndo_set_rx_mode_async, updating
> the rx mode callbacks to take explicit list parameters and iterate over them.
> 
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index fe8b886ff82ea..569c7de7c5e44 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> 
> [ ... ]
> 
> > @@ -13523,17 +13524,17 @@ void bnxt_get_ring_drv_stats(struct bnxt *bp,
> >  		bnxt_get_one_ring_drv_stats(bp, stats, &bp->bnapi[i]->cp_ring);
> >  }
> >  
> > -static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask)
> > +static bool bnxt_mc_list_updated(struct bnxt *bp, u32 *rx_mask,
> > +				 const struct netdev_hw_addr_list *mc)
> >  {
> >  	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
> > -	struct net_device *dev = bp->dev;
> >  	struct netdev_hw_addr *ha;
> >  	u8 *haddr;
> >  	int mc_count = 0;
> >  	bool update = false;
> >  	int off = 0;
> >  
> > -	netdev_for_each_mc_addr(ha, dev) {
> > +	netdev_hw_addr_list_for_each(ha, mc) {
> 
> When iterating over the list, bnxt_mc_list_updated() performs a memcpy()
> into vnic->mc_list.
> 
> If this executes asynchronously while bnxt_cfg_rx_mode() is mapping
> vnic->mc_list for DMA to the firmware via bnxt_hwrm_cfa_l2_set_rx_mask(),
> could the NIC hardware read torn or corrupted multicast MAC addresses?
> 
> Even if a later patch addresses the explicit list TOCTOU issue, is a
> driver-level spinlock needed here to serialize vnic state access between
> the two concurrent workqueues?

This is not something new I think? The same split was already happening
prior to conversion: bnxt_set_rx_mode updates mc_list and queues
BNXT_RX_MASK_SP_EVENT while bnxt_sp_task does bnxt_cfg_rx_mode.

After the conversion, both paths should be under ops lock, so we should
be in a better place.

> 
> [ ... ]
> 
> > @@ -15881,7 +15884,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
> >  	.ndo_start_xmit		= bnxt_start_xmit,
> >  	.ndo_stop		= bnxt_close,
> >  	.ndo_get_stats64	= bnxt_get_stats64,
> > -	.ndo_set_rx_mode	= bnxt_set_rx_mode,
> > +	.ndo_set_rx_mode_async	= bnxt_set_rx_mode,
> 
> Since this callback now runs asynchronously without the netif_addr_lock_bh()
> that the core networking stack previously held, does this introduce data races
> on the driver state?
> 
> With bnxt_set_rx_mode() executing locklessly in the netdev_rx_mode_work
> workqueue, and bnxt_cfg_rx_mode() concurrently modifying vnic->uc_list,
> vnic->uc_filter_count, and vnic->rx_mask in the driver's bnxt_sp_task
> workqueue, can they race when accessing the vnic state?
> 
> For example, could lockless assignments to vnic->rx_mask in bnxt_set_rx_mode()
> temporarily drop hardware fallback states (like the PROMISCUOUS bit) set by
> bnxt_cfg_rx_mode()?

Hmm, this is completely bogus? I don't see "lockless assignments". After
async conversion all these paths run fully under ops lock.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async Stanislav Fomichev
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, Michael Chan, Pavan Chebbi,
	Aleksandr Loktionov

With the introduction of ndo_set_rx_mode_async (as discussed in [1])
we can call bnxt_cfg_rx_mode directly. Convert bnxt_cfg_rx_mode to
use uc/mc snapshots and move its call in bnxt_sp_task to the
section that resets BNXT_STATE_IN_SP_TASK. Switch to direct call in
bnxt_set_rx_mode.

Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/ [1]

Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 33 +++++++++++------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 11956c7dde23..48aa9fc53229 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -11032,7 +11032,8 @@ static int bnxt_setup_nitroa0_vnic(struct bnxt *bp)
 	return rc;
 }
 
-static int bnxt_cfg_rx_mode(struct bnxt *);
+static int bnxt_cfg_rx_mode(struct bnxt *, struct netdev_hw_addr_list *,
+			    struct netdev_hw_addr_list *);
 static bool bnxt_mc_list_updated(struct bnxt *, u32 *,
 				 const struct netdev_hw_addr_list *);
 
@@ -11128,7 +11129,7 @@ static int bnxt_init_chip(struct bnxt *bp, bool irq_re_init)
 		vnic->rx_mask |= mask;
 	}
 
-	rc = bnxt_cfg_rx_mode(bp);
+	rc = bnxt_cfg_rx_mode(bp, &bp->dev->uc, &bp->dev->mc);
 	if (rc)
 		goto err_out;
 
@@ -13603,24 +13604,17 @@ static void bnxt_set_rx_mode(struct net_device *dev,
 	if (mask != vnic->rx_mask || uc_update || mc_update) {
 		vnic->rx_mask = mask;
 
-		bnxt_queue_sp_work(bp, BNXT_RX_MASK_SP_EVENT);
+		bnxt_cfg_rx_mode(bp, uc, mc);
 	}
 }
 
-static int bnxt_cfg_rx_mode(struct bnxt *bp)
+static int bnxt_cfg_rx_mode(struct bnxt *bp, struct netdev_hw_addr_list *uc,
+			    struct netdev_hw_addr_list *mc)
 {
 	struct net_device *dev = bp->dev;
 	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
 	struct netdev_hw_addr *ha;
 	int i, off = 0, rc;
-	bool uc_update;
-
-	netif_addr_lock_bh(dev);
-	uc_update = bnxt_uc_list_updated(bp, &dev->uc);
-	netif_addr_unlock_bh(dev);
-
-	if (!uc_update)
-		goto skip_uc;
 
 	for (i = 1; i < vnic->uc_filter_count; i++) {
 		struct bnxt_l2_filter *fltr = vnic->l2_filters[i];
@@ -13632,10 +13626,10 @@ static int bnxt_cfg_rx_mode(struct bnxt *bp)
 	vnic->uc_filter_count = 1;
 
 	netif_addr_lock_bh(dev);
-	if (netdev_uc_count(dev) > (BNXT_MAX_UC_ADDRS - 1)) {
+	if (netdev_hw_addr_list_count(uc) > (BNXT_MAX_UC_ADDRS - 1)) {
 		vnic->rx_mask |= CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS;
 	} else {
-		netdev_for_each_uc_addr(ha, dev) {
+		netdev_hw_addr_list_for_each(ha, uc) {
 			memcpy(vnic->uc_list + off, ha->addr, ETH_ALEN);
 			off += ETH_ALEN;
 			vnic->uc_filter_count++;
@@ -13662,7 +13656,6 @@ static int bnxt_cfg_rx_mode(struct bnxt *bp)
 	if (test_and_clear_bit(BNXT_STATE_L2_FILTER_RETRY, &bp->state))
 		netdev_notice(bp->dev, "Retry of L2 filter configuration successful.\n");
 
-skip_uc:
 	if ((vnic->rx_mask & CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS) &&
 	    !bnxt_promisc_ok(bp))
 		vnic->rx_mask &= ~CFA_L2_SET_RX_MASK_REQ_MASK_PROMISCUOUS;
@@ -14593,6 +14586,7 @@ static void bnxt_ulp_restart(struct bnxt *bp)
 static void bnxt_sp_task(struct work_struct *work)
 {
 	struct bnxt *bp = container_of(work, struct bnxt, sp_task);
+	struct net_device *dev = bp->dev;
 
 	set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
 	smp_mb__after_atomic();
@@ -14606,9 +14600,6 @@ static void bnxt_sp_task(struct work_struct *work)
 		bnxt_reenable_sriov(bp);
 	}
 
-	if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event))
-		bnxt_cfg_rx_mode(bp);
-
 	if (test_and_clear_bit(BNXT_RX_NTP_FLTR_SP_EVENT, &bp->sp_event))
 		bnxt_cfg_ntp_filters(bp);
 	if (test_and_clear_bit(BNXT_HWRM_EXEC_FWD_REQ_SP_EVENT, &bp->sp_event))
@@ -14673,6 +14664,12 @@ static void bnxt_sp_task(struct work_struct *work)
 	/* These functions below will clear BNXT_STATE_IN_SP_TASK.  They
 	 * must be the last functions to be called before exiting.
 	 */
+	if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event)) {
+		bnxt_lock_sp(bp);
+		bnxt_cfg_rx_mode(bp, &dev->uc, &dev->mc);
+		bnxt_unlock_sp(bp);
+	}
+
 	if (test_and_clear_bit(BNXT_RESET_TASK_SP_EVENT, &bp->sp_event))
 		bnxt_reset(bp, false);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode
  2026-04-02 22:55 ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode (74e346419df6) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
bnxt: use snapshot in bnxt_cfg_rx_mode

With the introduction of ndo_set_rx_mode_async we can call bnxt_cfg_rx_mode
directly. Convert bnxt_cfg_rx_mode to use uc/mc snapshots and move its
call in bnxt_sp_task to the section that resets BNXT_STATE_IN_SP_TASK.

Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
[ ... ]
> -static int bnxt_cfg_rx_mode(struct bnxt *bp)
> +static int bnxt_cfg_rx_mode(struct bnxt *bp, struct netdev_hw_addr_list *uc,
> +			    struct netdev_hw_addr_list *mc)

This isn't a bug, but is the mc parameter used in this function? The
multicast configuration appears to be handled synchronously prior to this call
by bnxt_mc_list_updated().

>  {
>  	struct net_device *dev = bp->dev;
>  	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
>  	struct netdev_hw_addr *ha;
>  	int i, off = 0, rc;
> -	bool uc_update;
> -
> -	netif_addr_lock_bh(dev);
> -	uc_update = bnxt_uc_list_updated(bp, &dev->uc);
> -	netif_addr_unlock_bh(dev);
> -
> -	if (!uc_update)
> -		goto skip_uc;

By removing this check, will the driver now unconditionally delete and recreate
all secondary hardware unicast MAC filters via HWRM commands on any RX mode
change?

For example, if an unprivileged user joins a multicast group via
IP_ADD_MEMBERSHIP, does this create a time window where the hardware drops
valid unicast traffic destined for secondary MACs?

Also, if any of the firmware commands to re-add a filter fail, will the
function return early and skip programming the RX mask entirely?

[ ... ]
> @@ -14684,6 +14675,12 @@ static void bnxt_sp_task(struct work_struct *work)
>  	/* These functions below will clear BNXT_STATE_IN_SP_TASK.  They
>  	 * must be the last functions to be called before exiting.
>  	 */
> +	if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event)) {
> +		bnxt_lock_sp(bp);

Could this introduce a race condition with bnxt_close()?

bnxt_lock_sp() temporarily clears the BNXT_STATE_IN_SP_TASK bit before
blocking on netdev_lock(bp->dev). If bnxt_close() is called concurrently, it
holds netdev_lock() and loops waiting for BNXT_STATE_IN_SP_TASK to clear.

When bnxt_lock_sp() clears the bit, bnxt_close() could unblock, shut down the
device, and free memory including bp->vnic_info via bnxt_free_vnics().

After bnxt_close() completes and releases the lock, bnxt_sp_task() would
acquire it and unconditionally call bnxt_cfg_rx_mode(), which dereferences the
now-freed bp->vnic_info.

Would it be safer to check test_bit(BNXT_STATE_OPEN, &bp->state) after
acquiring the lock, similar to what neighboring handlers do?

> +		bnxt_cfg_rx_mode(bp, &dev->uc, &dev->mc);
> +		bnxt_unlock_sp(bp);
> +	}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode (74e346419df6)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> bnxt: use snapshot in bnxt_cfg_rx_mode
> 
> With the introduction of ndo_set_rx_mode_async we can call bnxt_cfg_rx_mode
> directly. Convert bnxt_cfg_rx_mode to use uc/mc snapshots and move its
> call in bnxt_sp_task to the section that resets BNXT_STATE_IN_SP_TASK.
> 
> Link: https://lore.kernel.org/netdev/CACKFLi=5vj8hPqEUKDd8RTw3au5G+zRgQEqjF+6NZnyoNm90KA@mail.gmail.com/
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> [ ... ]
> > -static int bnxt_cfg_rx_mode(struct bnxt *bp)
> > +static int bnxt_cfg_rx_mode(struct bnxt *bp, struct netdev_hw_addr_list *uc,
> > +			    struct netdev_hw_addr_list *mc)
> 
> This isn't a bug, but is the mc parameter used in this function? The
> multicast configuration appears to be handled synchronously prior to this call
> by bnxt_mc_list_updated().

True, mc is fully consumed in bnxt_set_rx_mode (bnxt_mc_list_updated), will
drop.

> 
> >  {
> >  	struct net_device *dev = bp->dev;
> >  	struct bnxt_vnic_info *vnic = &bp->vnic_info[BNXT_VNIC_DEFAULT];
> >  	struct netdev_hw_addr *ha;
> >  	int i, off = 0, rc;
> > -	bool uc_update;
> > -
> > -	netif_addr_lock_bh(dev);
> > -	uc_update = bnxt_uc_list_updated(bp, &dev->uc);
> > -	netif_addr_unlock_bh(dev);
> > -
> > -	if (!uc_update)
> > -		goto skip_uc;
> 
> By removing this check, will the driver now unconditionally delete and recreate
> all secondary hardware unicast MAC filters via HWRM commands on any RX mode
> change?
>
> For example, if an unprivileged user joins a multicast group via
> IP_ADD_MEMBERSHIP, does this create a time window where the hardware drops
> valid unicast traffic destined for secondary MACs?
>
> Also, if any of the firmware commands to re-add a filter fail, will the
> function return early and skip programming the RX mask entirely?

I don't think this is problematic, but I will the original uc_update and
bring back skip_uc, that will skip unnecessary uc list transactions.

> 
> [ ... ]
> > @@ -14684,6 +14675,12 @@ static void bnxt_sp_task(struct work_struct *work)
> >  	/* These functions below will clear BNXT_STATE_IN_SP_TASK.  They
> >  	 * must be the last functions to be called before exiting.
> >  	 */
> > +	if (test_and_clear_bit(BNXT_RX_MASK_SP_EVENT, &bp->sp_event)) {
> > +		bnxt_lock_sp(bp);
> 
> Could this introduce a race condition with bnxt_close()?
>
> bnxt_lock_sp() temporarily clears the BNXT_STATE_IN_SP_TASK bit before
> blocking on netdev_lock(bp->dev). If bnxt_close() is called concurrently, it
> holds netdev_lock() and loops waiting for BNXT_STATE_IN_SP_TASK to clear.
>
> When bnxt_lock_sp() clears the bit, bnxt_close() could unblock, shut down the
> device, and free memory including bp->vnic_info via bnxt_free_vnics().
>
> After bnxt_close() completes and releases the lock, bnxt_sp_task() would
> acquire it and unconditionally call bnxt_cfg_rx_mode(), which dereferences the
> now-freed bp->vnic_info.
>
> Would it be safer to check test_bit(BNXT_STATE_OPEN, &bp->state) after
> acquiring the lock, similar to what neighboring handlers do?

Will do, this seems similar to handling in bnxt_reset.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-04  0:06   ` Jakub Kicinski
  2026-04-02 22:55 ` [PATCH net-next v5 10/14] netdevsim: convert to ndo_set_rx_mode_async Stanislav Fomichev
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, Tony Nguyen, Przemek Kitszel,
	Aleksandr Loktionov

Convert iavf from ndo_set_rx_mode to ndo_set_rx_mode_async.
iavf_set_rx_mode now takes explicit uc/mc list parameters and
uses __hw_addr_sync_dev on the snapshots instead of __dev_uc_sync
and __dev_mc_sync.

The iavf_configure internal caller passes the real lists directly.

Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 drivers/net/ethernet/intel/iavf/iavf_main.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index dad001abc908..485acf402a47 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1150,14 +1150,18 @@ bool iavf_promiscuous_mode_changed(struct iavf_adapter *adapter)
 /**
  * iavf_set_rx_mode - NDO callback to set the netdev filters
  * @netdev: network interface device structure
+ * @uc: snapshot of uc address list
+ * @mc: snapshot of mc address list
  **/
-static void iavf_set_rx_mode(struct net_device *netdev)
+static void iavf_set_rx_mode(struct net_device *netdev,
+			     struct netdev_hw_addr_list *uc,
+			     struct netdev_hw_addr_list *mc)
 {
 	struct iavf_adapter *adapter = netdev_priv(netdev);
 
 	spin_lock_bh(&adapter->mac_vlan_list_lock);
-	__dev_uc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
-	__dev_mc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
+	__hw_addr_sync_dev(uc, netdev, iavf_addr_sync, iavf_addr_unsync);
+	__hw_addr_sync_dev(mc, netdev, iavf_addr_sync, iavf_addr_unsync);
 	spin_unlock_bh(&adapter->mac_vlan_list_lock);
 
 	spin_lock_bh(&adapter->current_netdev_promisc_flags_lock);
@@ -1210,7 +1214,7 @@ static void iavf_configure(struct iavf_adapter *adapter)
 	struct net_device *netdev = adapter->netdev;
 	int i;
 
-	iavf_set_rx_mode(netdev);
+	iavf_set_rx_mode(netdev, &netdev->uc, &netdev->mc);
 
 	iavf_configure_tx(adapter);
 	iavf_configure_rx(adapter);
@@ -5153,7 +5157,7 @@ static const struct net_device_ops iavf_netdev_ops = {
 	.ndo_open		= iavf_open,
 	.ndo_stop		= iavf_close,
 	.ndo_start_xmit		= iavf_xmit_frame,
-	.ndo_set_rx_mode	= iavf_set_rx_mode,
+	.ndo_set_rx_mode_async	= iavf_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= iavf_set_mac,
 	.ndo_change_mtu		= iavf_change_mtu,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-04  0:06   ` Jakub Kicinski
  2026-04-06 22:29     ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async (b1dc10d5dff2) sdf.kernel
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Kicinski @ 2026-04-04  0:06 UTC (permalink / raw)
  To: sdf; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
iavf: convert to ndo_set_rx_mode_async

This commit converts the iavf driver from using ndo_set_rx_mode to
ndo_set_rx_mode_async. It updates iavf_set_rx_mode to take explicit
unicast and multicast list parameters and uses __hw_addr_sync_dev on
the snapshots instead of __dev_uc_sync and __dev_mc_sync.

> diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> @@ -1150,14 +1150,18 @@ bool iavf_promiscuous_mode_changed(struct iavf_adapter *adapter)
> -static void iavf_set_rx_mode(struct net_device *netdev)
> +static void iavf_set_rx_mode(struct net_device *netdev,
> +			     struct netdev_hw_addr_list *uc,
> +			     struct netdev_hw_addr_list *mc)
>  {
>  	struct iavf_adapter *adapter = netdev_priv(netdev);
>  
>  	spin_lock_bh(&adapter->mac_vlan_list_lock);
> -	__dev_uc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
> -	__dev_mc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
> +	__hw_addr_sync_dev(uc, netdev, iavf_addr_sync, iavf_addr_unsync);
> +	__hw_addr_sync_dev(mc, netdev, iavf_addr_sync, iavf_addr_unsync);
>  	spin_unlock_bh(&adapter->mac_vlan_list_lock);

[ ... ]

> @@ -1210,7 +1214,7 @@ static void iavf_configure(struct iavf_adapter *adapter)
>  	struct net_device *netdev = adapter->netdev;
>  	int i;
>  
> -	iavf_set_rx_mode(netdev);
> +	iavf_set_rx_mode(netdev, &netdev->uc, &netdev->mc);

Does this code iterate and mutate the live netdev address lists without
holding the required locks?

Previously, iavf_set_rx_mode() called __dev_uc_sync() and __dev_mc_sync(),
which internally acquire netif_addr_lock_bh() to protect the live netdev
address lists.

Now, iavf_configure() passes the live lists (&netdev->uc and &netdev->mc)
directly to iavf_set_rx_mode(), which calls __hw_addr_sync_dev() on them.
__hw_addr_sync_dev() iterates and modifies the lists by updating sync_cnt,
refcount, and removing stale entries.

Since iavf_configure() is called from contexts like iavf_open() and the
driver's watchdog task, it appears netif_addr_lock_bh() is not held here.

Could this cause list or RB-tree corruption if there are concurrent
modifications, such as IGMP/MLD timers adding multicast addresses in softirq
while holding netif_addr_lock_bh()?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async (b1dc10d5dff2)
  2026-04-04  0:06   ` Jakub Kicinski
@ 2026-04-06 22:29     ` sdf.kernel
  0 siblings, 0 replies; 33+ messages in thread
From: sdf.kernel @ 2026-04-06 22:29 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jakub Kicinski, netdev, davem, edumazet, pabeni, jiri, jv

> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> iavf: convert to ndo_set_rx_mode_async
> 
> This commit converts the iavf driver from using ndo_set_rx_mode to
> ndo_set_rx_mode_async. It updates iavf_set_rx_mode to take explicit
> unicast and multicast list parameters and uses __hw_addr_sync_dev on
> the snapshots instead of __dev_uc_sync and __dev_mc_sync.
> 
> > diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
> > --- a/drivers/net/ethernet/intel/iavf/iavf_main.c
> > +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
> > @@ -1150,14 +1150,18 @@ bool iavf_promiscuous_mode_changed(struct iavf_adapter *adapter)
> > -static void iavf_set_rx_mode(struct net_device *netdev)
> > +static void iavf_set_rx_mode(struct net_device *netdev,
> > +			     struct netdev_hw_addr_list *uc,
> > +			     struct netdev_hw_addr_list *mc)
> >  {
> >  	struct iavf_adapter *adapter = netdev_priv(netdev);
> >  
> >  	spin_lock_bh(&adapter->mac_vlan_list_lock);
> > -	__dev_uc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
> > -	__dev_mc_sync(netdev, iavf_addr_sync, iavf_addr_unsync);
> > +	__hw_addr_sync_dev(uc, netdev, iavf_addr_sync, iavf_addr_unsync);
> > +	__hw_addr_sync_dev(mc, netdev, iavf_addr_sync, iavf_addr_unsync);
> >  	spin_unlock_bh(&adapter->mac_vlan_list_lock);
> 
> [ ... ]
> 
> > @@ -1210,7 +1214,7 @@ static void iavf_configure(struct iavf_adapter *adapter)
> >  	struct net_device *netdev = adapter->netdev;
> >  	int i;
> >  
> > -	iavf_set_rx_mode(netdev);
> > +	iavf_set_rx_mode(netdev, &netdev->uc, &netdev->mc);
> 
> Does this code iterate and mutate the live netdev address lists without
> holding the required locks?
>
> Previously, iavf_set_rx_mode() called __dev_uc_sync() and __dev_mc_sync(),
> which internally acquire netif_addr_lock_bh() to protect the live netdev
> address lists.
> 
> Now, iavf_configure() passes the live lists (&netdev->uc and &netdev->mc)
> directly to iavf_set_rx_mode(), which calls __hw_addr_sync_dev() on them.
> __hw_addr_sync_dev() iterates and modifies the lists by updating sync_cnt,
> refcount, and removing stale entries.
> 
> Since iavf_configure() is called from contexts like iavf_open() and the
> driver's watchdog task, it appears netif_addr_lock_bh() is not held here.
> 
> Could this cause list or RB-tree corruption if there are concurrent
> modifications, such as IGMP/MLD timers adding multicast addresses in softirq
> while holding netif_addr_lock_bh()?

Most of this is seems bogus:
- ndo_set_rx_mode_async path is fine because of the snapshot
- iavf_configure is fine because of the ops lock

What I can do is to wrap iavf_set_rx_mode in iavf_configure into
netif_addr_lock_bh to satisfy the bots?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 10/14] netdevsim: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 11/14] dummy: " Stanislav Fomichev
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni

Convert netdevsim from ndo_set_rx_mode to ndo_set_rx_mode_async.
The callback is a no-op stub so just update the signature and
ops struct wiring.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 drivers/net/netdevsim/netdev.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index c71b8d116f18..73edc4817d62 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -185,7 +185,9 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-static void nsim_set_rx_mode(struct net_device *dev)
+static void nsim_set_rx_mode(struct net_device *dev,
+			     struct netdev_hw_addr_list *uc,
+			     struct netdev_hw_addr_list *mc)
 {
 }
 
@@ -593,7 +595,7 @@ static const struct net_shaper_ops nsim_shaper_ops = {
 
 static const struct net_device_ops nsim_netdev_ops = {
 	.ndo_start_xmit		= nsim_start_xmit,
-	.ndo_set_rx_mode	= nsim_set_rx_mode,
+	.ndo_set_rx_mode_async	= nsim_set_rx_mode,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_change_mtu		= nsim_change_mtu,
@@ -616,7 +618,7 @@ static const struct net_device_ops nsim_netdev_ops = {
 
 static const struct net_device_ops nsim_vf_netdev_ops = {
 	.ndo_start_xmit		= nsim_start_xmit,
-	.ndo_set_rx_mode	= nsim_set_rx_mode,
+	.ndo_set_rx_mode_async	= nsim_set_rx_mode,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_change_mtu		= nsim_change_mtu,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 11/14] dummy: convert to ndo_set_rx_mode_async
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 10/14] netdevsim: convert to ndo_set_rx_mode_async Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 12/14] net: warn ops-locked drivers still using ndo_set_rx_mode Stanislav Fomichev
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, Aleksandr Loktionov

Convert dummy driver from ndo_set_rx_mode to ndo_set_rx_mode_async.
The dummy driver's set_multicast_list is a no-op, so the conversion
is straightforward: update the signature and the ops assignment.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 drivers/net/dummy.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index d6bdad4baadd..f8a4eb365c3d 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -47,7 +47,9 @@
 static int numdummies = 1;
 
 /* fake multicast ability */
-static void set_multicast_list(struct net_device *dev)
+static void set_multicast_list(struct net_device *dev,
+			       struct netdev_hw_addr_list *uc,
+			       struct netdev_hw_addr_list *mc)
 {
 }
 
@@ -87,7 +89,7 @@ static const struct net_device_ops dummy_netdev_ops = {
 	.ndo_init		= dummy_dev_init,
 	.ndo_start_xmit		= dummy_xmit,
 	.ndo_validate_addr	= eth_validate_addr,
-	.ndo_set_rx_mode	= set_multicast_list,
+	.ndo_set_rx_mode_async	= set_multicast_list,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_get_stats64	= dummy_get_stats64,
 	.ndo_change_carrier	= dummy_change_carrier,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 12/14] net: warn ops-locked drivers still using ndo_set_rx_mode
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 11/14] dummy: " Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 13/14] selftests: net: add team_bridge_macvlan rx_mode test Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 14/14] selftests: net: use ip commands instead of teamd in team " Stanislav Fomichev
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, Aleksandr Loktionov

Now that all in-tree ops-locked drivers have been converted to
ndo_set_rx_mode_async, add a warning in register_netdevice to catch
any remaining or newly added drivers that use ndo_set_rx_mode with
ops locking. This ensures future driver authors are guided toward
the async path.

Also route ops-locked devices through netdev_rx_mode_work even if they
lack rx_mode NDOs, to ensure netdev_ops_assert_locked() does not fire
on the legacy path where only RTNL is held.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 net/core/dev.c            | 5 +++++
 net/core/dev_addr_lists.c | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 3ddf347dcdd7..89e31b1187a9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11341,6 +11341,11 @@ int register_netdevice(struct net_device *dev)
 		goto err_uninit;
 	}
 
+	if (netdev_need_ops_lock(dev) &&
+	    dev->netdev_ops->ndo_set_rx_mode &&
+	    !dev->netdev_ops->ndo_set_rx_mode_async)
+		netdev_WARN(dev, "ops-locked drivers should use ndo_set_rx_mode_async\n");
+
 	ret = netdev_do_alloc_pcpu_stats(dev);
 	if (ret)
 		goto err_uninit;
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 2cff791ce374..e2d1d671ea56 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -1365,7 +1365,8 @@ void __dev_set_rx_mode(struct net_device *dev)
 	if (!netif_device_present(dev))
 		return;
 
-	if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags) {
+	if (ops->ndo_set_rx_mode_async || ops->ndo_change_rx_flags ||
+	    netdev_need_ops_lock(dev)) {
 		netif_rx_mode_queue(dev);
 		return;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 13/14] selftests: net: add team_bridge_macvlan rx_mode test
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (11 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 12/14] net: warn ops-locked drivers still using ndo_set_rx_mode Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  2026-04-02 22:55 ` [PATCH net-next v5 14/14] selftests: net: use ip commands instead of teamd in team " Stanislav Fomichev
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni

Add a test that exercises the ndo_change_rx_flags path through a
macvlan -> bridge -> team -> dummy stack. This triggers dev_uc_add
under addr_list_lock which flips promiscuity on the lower device.
With the new work queue approach, this must not deadlock.

Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 tools/testing/selftests/net/config       |  3 ++
 tools/testing/selftests/net/rtnetlink.sh | 44 ++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 2a390cae41bf..38611ea11c6b 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -101,6 +101,9 @@ CONFIG_NET_SCH_HTB=m
 CONFIG_NET_SCH_INGRESS=m
 CONFIG_NET_SCH_NETEM=y
 CONFIG_NET_SCH_PRIO=m
+CONFIG_NET_TEAM=y
+CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=y
+CONFIG_NET_TEAM_MODE_LOADBALANCE=y
 CONFIG_NET_VRF=y
 CONFIG_NF_CONNTRACK=m
 CONFIG_NF_CONNTRACK_OVS=y
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 5a5ff88321d5..c499953d4885 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -23,6 +23,7 @@ ALL_TESTS="
 	kci_test_encap
 	kci_test_macsec
 	kci_test_macsec_vlan
+	kci_test_team_bridge_macvlan
 	kci_test_ipsec
 	kci_test_ipsec_offload
 	kci_test_fdb_get
@@ -636,6 +637,49 @@ kci_test_macsec_vlan()
 	end_test "PASS: macsec_vlan"
 }
 
+# Test ndo_change_rx_flags call from dev_uc_add under addr_list_lock spinlock.
+# When we are flipping the promisc, make sure it runs on the work queue.
+#
+# https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/
+# With (more conventional) macvlan instead of macsec.
+# macvlan -> bridge -> team -> dummy
+kci_test_team_bridge_macvlan()
+{
+	local vlan="test_macv1"
+	local bridge="test_br1"
+	local team="test_team1"
+	local dummy="test_dummy1"
+	local ret=0
+
+	run_cmd ip link add $team type team
+	if [ $ret -ne 0 ]; then
+		end_test "SKIP: team_bridge_macvlan: can't add team interface"
+		return $ksft_skip
+	fi
+
+	run_cmd ip link add $dummy type dummy
+	run_cmd ip link set $dummy master $team
+	run_cmd ip link set $team up
+	run_cmd ip link add $bridge type bridge vlan_filtering 1
+	run_cmd ip link set $bridge up
+	run_cmd ip link set $team master $bridge
+	run_cmd ip link add link $bridge name $vlan \
+		address 00:aa:bb:cc:dd:ee type macvlan mode bridge
+	run_cmd ip link set $vlan up
+
+	run_cmd ip link del $vlan
+	run_cmd ip link del $bridge
+	run_cmd ip link del $team
+	run_cmd ip link del $dummy
+
+	if [ $ret -ne 0 ]; then
+		end_test "FAIL: team_bridge_macvlan"
+		return 1
+	fi
+
+	end_test "PASS: team_bridge_macvlan"
+}
+
 #-------------------------------------------------------------------
 # Example commands
 #   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH net-next v5 14/14] selftests: net: use ip commands instead of teamd in team rx_mode test
  2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
                   ` (12 preceding siblings ...)
  2026-04-02 22:55 ` [PATCH net-next v5 13/14] selftests: net: add team_bridge_macvlan rx_mode test Stanislav Fomichev
@ 2026-04-02 22:55 ` Stanislav Fomichev
  13 siblings, 0 replies; 33+ messages in thread
From: Stanislav Fomichev @ 2026-04-02 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, Jiri Pirko, Jay Vosburgh

Replace teamd daemon usage with ip link commands for team device
setup. teamd -d daemonizes and returns to the shell before port
addition completes, creating a race: the test may create the macvlan
(and check for its address on a slave) before teamd has finished
adding ports. This makes the test inherently dependent on scheduling
timing.

Using ip commands makes port addition synchronous, removing the race
and making the test deterministic.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jay Vosburgh <jv@jvosburgh.net>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
---
 .../selftests/drivers/net/bonding/lag_lib.sh    | 17 +++--------------
 .../drivers/net/team/dev_addr_lists.sh          |  2 --
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/bonding/lag_lib.sh b/tools/testing/selftests/drivers/net/bonding/lag_lib.sh
index bf9bcd1b5ec0..f2e43b6c4c81 100644
--- a/tools/testing/selftests/drivers/net/bonding/lag_lib.sh
+++ b/tools/testing/selftests/drivers/net/bonding/lag_lib.sh
@@ -23,20 +23,9 @@ test_LAG_cleanup()
 		ip link set dev dummy2 master "$name"
 	elif [ "$driver" = "team" ]; then
 		name="team0"
-		teamd -d -c '
-			{
-				"device": "'"$name"'",
-				"runner": {
-					"name": "'"$mode"'"
-				},
-				"ports": {
-					"dummy1":
-						{},
-					"dummy2":
-						{}
-				}
-			}
-		'
+		ip link add "$name" type team
+		ip link set dev dummy1 master "$name"
+		ip link set dev dummy2 master "$name"
 		ip link set dev "$name" up
 	else
 		check_err 1
diff --git a/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh b/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh
index b1ec7755b783..26469f3be022 100755
--- a/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh
+++ b/tools/testing/selftests/drivers/net/team/dev_addr_lists.sh
@@ -42,8 +42,6 @@ team_cleanup()
 }
 
 
-require_command teamd
-
 trap cleanup EXIT
 
 tests_run
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-04-06 22:29 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 22:55 [PATCH net-next v5 00/14] net: sleepable ndo_set_rx_mode Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 01/14] net: add address list snapshot and reconciliation infrastructure (123ac7a76378) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2) sdf.kernel
2026-04-04  0:27   ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 02/14] net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work (61d75e67dcd2) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 03/14] net: move promiscuity handling into netdev_rx_mode_work (ddeab417d841) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 04/14] net: cache snapshot entries for ndo_set_rx_mode_async Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 05/14] fbnic: convert to ndo_set_rx_mode_async (1d5e76c60ed0) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 06/14] mlx5: convert to ndo_set_rx_mode_async (3691f90f6593) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 07/14] bnxt: convert to ndo_set_rx_mode_async (c1776bbe53ec) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 08/14] bnxt: use snapshot in bnxt_cfg_rx_mode (74e346419df6) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async Stanislav Fomichev
2026-04-04  0:06   ` Jakub Kicinski
2026-04-06 22:29     ` [PATCH net-next v5 09/14] iavf: convert to ndo_set_rx_mode_async (b1dc10d5dff2) sdf.kernel
2026-04-02 22:55 ` [PATCH net-next v5 10/14] netdevsim: convert to ndo_set_rx_mode_async Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 11/14] dummy: " Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 12/14] net: warn ops-locked drivers still using ndo_set_rx_mode Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 13/14] selftests: net: add team_bridge_macvlan rx_mode test Stanislav Fomichev
2026-04-02 22:55 ` [PATCH net-next v5 14/14] selftests: net: use ip commands instead of teamd in team " Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox