From: Kuniyuki Iwashima <kuniyu@google.com>
To: "David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Simon Horman <horms@kernel.org>,
Kuniyuki Iwashima <kuniyu@google.com>,
Kuniyuki Iwashima <kuni1840@gmail.com>,
netdev@vger.kernel.org
Subject: [PATCH v2 net-next 06/14] net: Add per-netns netdev unregistration infra.
Date: Fri, 3 Jul 2026 00:09:17 +0000 [thread overview]
Message-ID: <20260703001009.1572444-7-kuniyu@google.com> (raw)
In-Reply-To: <20260703001009.1572444-1-kuniyu@google.com>
When we need to unregister a netdev in a different netns, we will
delegate its unregistration to per-netns work.
There are three types of such cross-netns devices:
1. Paired devices (e.g., netkit, veth, vxcan)
-> Unregistering one device also deletes its peer, which
may reside in another netns.
2. Tunnel devices (e.g., bareudp, geneve, etc)
-> Destroying a netns removes devices in another netns if
their backend sockets reside in the dying netns
3. Stacked devices (e.g., ipvlan, macvlan, etc)
-> Removing the lower device also removes multiple upper
devices, each of which may reside in different namespaces.
In these cases, we will use unregister_netdevice_queue_net() to
queue such potential cross-netns devices for destruction.
Each driver must not call both unregister_netdevice_queue_net()
and unregister_netdevice_queue() for the same device. See the
subsequent veth/bareudp/ipvlan patches for how they avoid double
queueing.
unregister_netdevice_queue_net() takes net and dev. If dev resides
in the net, it simply calls unregister_netdevice_queue().
If dev_net(dev) is different from the net, it enqueues the device
to dev_net(dev)->dev_unreg_head and schedules the per-netns work.
When __rtnl_net_unlock() is called from the per-netns work (or another
thread already holding the lock), unregister_netdevice_many_net()
collects the queued devices and calls unregister_netdevice_many()
to perform the actual unregistration.
During netns dismantle, rtnl_net_flush_workqueue() is called at the
end of default_device_exit_batch() to ensure that cross-netns
devices in the other alive netns are unregistered.
Once RTNL is removed, a device could be moved to another netns while
being queued to net->dev_unreg_head.
__dev_change_net_namespace() handles this race by acquiring
net->dev_unreg_lock of both the old and new netns after dev_set_net()
and moving the device between their dev_unreg_head lists.
Since dev_set_net() and unregister_netdevice_queue_net() are
synchronised by netdev_lock(), the device is either queued to the
old netns's dev_unreg_head and then moved, or queued directly to
the new netns.
Note that unregister_netdevice_move_net() does not need to call
rtnl_net_queue_work() because __dev_change_net_namespace() is
(supposed to be) called with rtnl_net_lock(). (Not all callers
hold it yet, but the race does not happen until all callers
are converted and RTNL is removed.)
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
v2:
* Use spin_lock_nested() in unregister_netdevice_move_net()
* Add kdoc for dev->unreg_list_net
* Add DEBUG_NET_WARN_ON_ONCE() in unregister_netdevice_queue()
---
include/linux/netdevice.h | 18 ++++++++
include/net/net_namespace.h | 2 +
net/core/dev.c | 91 +++++++++++++++++++++++++++++++++++++
net/core/net_namespace.c | 2 +
net/core/rtnetlink.c | 4 ++
5 files changed, 117 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9981d637f8b5..108d8d7ea75b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1845,6 +1845,8 @@ enum netdev_reg_state {
* @napi_list: List entry used for polling NAPI devices
* @unreg_list: List entry when we are unregistering the
* device; see the function unregister_netdev
+ * @unreg_list_net:List entry when we are unregistering the cross-netns
+ * device; see the function unregister_netdevice_queue_net()
* @close_list: List entry used when we are closing the device
* @ptype_all: Device-specific packet handlers for all protocols
* @ptype_specific: Device-specific, protocol-specific packet handlers
@@ -2241,6 +2243,9 @@ struct net_device {
struct list_head dev_list;
struct list_head napi_list;
struct list_head unreg_list;
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ struct list_head unreg_list_net;
+#endif
struct list_head close_list;
struct list_head ptype_all;
@@ -3472,6 +3477,19 @@ static inline void unregister_netdevice(struct net_device *dev)
unregister_netdevice_queue(dev, NULL);
}
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
+ struct list_head *head);
+void unregister_netdevice_many_net(struct net *net);
+#else
+static inline void unregister_netdevice_queue_net(struct net *net,
+ struct net_device *dev,
+ struct list_head *head)
+{
+ unregister_netdevice_queue(dev, head);
+}
+#endif
+
int netdev_refcnt_read(const struct net_device *dev);
void free_netdev(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index a989019af5f7..501af1999fe8 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -198,6 +198,8 @@ struct net {
/* Move to a better place when the config guard is removed. */
struct mutex rtnl_mutex;
struct work_struct rtnl_work;
+ struct list_head dev_unreg_head;
+ spinlock_t dev_unreg_lock;
#endif
#if IS_ENABLED(CONFIG_VSOCKETS)
struct netns_vsock vsock;
diff --git a/net/core/dev.c b/net/core/dev.c
index 48818a194fa5..ed41c704c941 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12092,6 +12092,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
INIT_LIST_HEAD(&dev->napi_list);
INIT_LIST_HEAD(&dev->unreg_list);
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ INIT_LIST_HEAD(&dev->unreg_list_net);
+#endif
INIT_LIST_HEAD(&dev->close_list);
INIT_LIST_HEAD(&dev->link_watch_list);
INIT_LIST_HEAD(&dev->adj_list.upper);
@@ -12309,6 +12312,10 @@ void unregister_netdevice_queue(struct net_device *dev, struct list_head *head)
{
ASSERT_RTNL();
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ DEBUG_NET_WARN_ON_ONCE(!list_empty(&dev->unreg_list_net));
+#endif
+
if (head) {
list_move_tail(&dev->unreg_list, head);
} else {
@@ -12485,6 +12492,16 @@ void unregister_netdevice_many_notify(struct list_head *head,
synchronize_net();
list_for_each_entry(dev, head, unreg_list) {
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ struct net *net = dev_net(dev);
+
+ /* spin_lock() can be moved outside of the loop
+ * once the per-netns RTNL conversion completes.
+ */
+ spin_lock(&net->dev_unreg_lock);
+ list_del(&dev->unreg_list_net);
+ spin_unlock(&net->dev_unreg_lock);
+#endif
netdev_put(dev, &dev->dev_registered_tracker);
net_set_todo(dev);
cnt++;
@@ -12507,6 +12524,74 @@ void unregister_netdevice_many(struct list_head *head)
}
EXPORT_SYMBOL(unregister_netdevice_many);
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
+ struct list_head *head)
+{
+ netdev_lock(dev);
+
+ if (net_eq(dev_net(dev), net)) {
+ netdev_unlock(dev);
+ unregister_netdevice_queue(dev, head);
+ return;
+ }
+
+ net = dev_net(dev);
+
+ spin_lock(&net->dev_unreg_lock);
+
+ DEBUG_NET_WARN_ON_ONCE(!list_empty(&dev->unreg_list));
+ DEBUG_NET_WARN_ON_ONCE(!list_empty(&dev->unreg_list_net));
+
+ list_add_tail(&dev->unreg_list_net, &net->dev_unreg_head);
+ rtnl_net_queue_work(net);
+
+ spin_unlock(&net->dev_unreg_lock);
+
+ netdev_unlock(dev);
+}
+EXPORT_SYMBOL(unregister_netdevice_queue_net);
+
+static void unregister_netdevice_move_net(struct net *net_old,
+ struct net *net,
+ struct net_device *dev)
+{
+ if (net_old > net) {
+ spin_lock(&net->dev_unreg_lock);
+ spin_lock_nested(&net_old->dev_unreg_lock, SINGLE_DEPTH_NESTING);
+ } else {
+ spin_lock(&net_old->dev_unreg_lock);
+ spin_lock_nested(&net->dev_unreg_lock, SINGLE_DEPTH_NESTING);
+ }
+
+ if (!list_empty(&dev->unreg_list_net)) {
+ list_del(&dev->unreg_list_net);
+ list_add_tail(&dev->unreg_list_net, &net->dev_unreg_head);
+ }
+
+ spin_unlock(&net_old->dev_unreg_lock);
+ spin_unlock(&net->dev_unreg_lock);
+}
+
+void unregister_netdevice_many_net(struct net *net)
+{
+ struct net_device *dev, *tmp;
+ LIST_HEAD(unreg_head_net);
+ LIST_HEAD(unreg_head);
+
+ spin_lock(&net->dev_unreg_lock);
+ list_splice_init(&net->dev_unreg_head, &unreg_head_net);
+ spin_unlock(&net->dev_unreg_lock);
+
+ list_for_each_entry_safe(dev, tmp, &unreg_head_net, unreg_list_net) {
+ list_del_init(&dev->unreg_list_net);
+ list_add_tail(&dev->unreg_list, &unreg_head);
+ }
+
+ unregister_netdevice_many(&unreg_head);
+}
+#endif
+
/**
* unregister_netdev - remove device from the kernel
* @dev: device
@@ -12663,6 +12748,10 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
netdev_unlock(dev);
dev->ifindex = new_ifindex;
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ unregister_netdevice_move_net(net_old, net, dev);
+#endif
+
if (new_name[0]) {
/* Rename the netdev to prepared name */
write_seqlock_bh(&netdev_rename_lock);
@@ -13105,6 +13194,8 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
}
unregister_netdevice_many(&dev_kill_list);
rtnl_unlock();
+
+ rtnl_net_flush_workqueue();
}
static struct pernet_operations __net_initdata default_device_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index d1aeff9de580..578b48cf5318 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -423,6 +423,8 @@ static __net_init int preinit_net(struct net *net, struct user_namespace *user_n
mutex_init(&net->rtnl_mutex);
lock_set_cmp_fn(&net->rtnl_mutex, rtnl_net_lock_cmp_fn, NULL);
INIT_WORK(&net->rtnl_work, rtnl_net_work_func);
+ INIT_LIST_HEAD(&net->dev_unreg_head);
+ spin_lock_init(&net->dev_unreg_lock);
#endif
INIT_LIST_HEAD(&net->ptype_all);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 7959519e7375..544498d3c325 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -197,6 +197,7 @@ void __rtnl_net_unlock(struct net *net)
{
ASSERT_RTNL();
+ unregister_netdevice_many_net(net);
mutex_unlock(&net->rtnl_mutex);
}
EXPORT_SYMBOL(__rtnl_net_unlock);
@@ -290,6 +291,9 @@ void rtnl_net_work_func(struct work_struct *work)
{
struct net *net = container_of(work, struct net, rtnl_work);
+ if (list_empty(&net->dev_unreg_head))
+ return;
+
rtnl_net_lock(net);
rtnl_net_unlock(net);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
next prev parent reply other threads:[~2026-07-03 0:10 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 0:09 [PATCH v2 net-next 00/14] net: Support per-netns device unregistration Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 01/14] rtnetlink: Lock sock_net(skb->sk) in rtnl_newlink() Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 02/14] rtnetlink: Call unregister_netdevice_many() only once in rtnl_link_unregister() Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 03/14] rtnetlink: Add per-netns rtnl_work Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 04/14] net: Wrap default_device_exit_net() with __rtnl_net_lock() Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 05/14] net: Hold __rtnl_net_lock() in netdev_wait_allrefs_any() Kuniyuki Iwashima
2026-07-03 0:09 ` Kuniyuki Iwashima [this message]
2026-07-03 0:09 ` [PATCH v2 net-next 07/14] net: Call unregister_netdevice_many() per netns Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 08/14] veth: Support per-netns device unregistration Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 09/14] bareudp: Protect bareudp_list with mutex Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 10/14] bareudp: Support per-netns netdev unregistration Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 11/14] ipvlan: Convert ipvl_port.count to refcount_t Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 12/14] ipvlan: Synchronise ipvlan_init() and ipvlan_uninit() for the same lower dev Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 13/14] ipvlan: Protect ipvl_port.ipvlans with mutex Kuniyuki Iwashima
2026-07-03 0:09 ` [PATCH v2 net-next 14/14] ipvlan: Support per-netns netdev unregistration Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703001009.1572444-7-kuniyu@google.com \
--to=kuniyu@google.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox