* [PATCH v1 net-next 06/14] net: Add per-netns netdev unregistration infra.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
When we need to unregister a netdev in a different netns, we will
delegate its unregistration to per-netns work.
There are three types of such cross-netns devices:
1. Paired devices (e.g., netkit, veth, vxcan)
-> Unregistering one device also deletes its peer, which
may reside in another netns.
2. Tunnel devices (e.g., bareudp, geneve, etc)
-> Destroying a netns removes devices in another netns if
their backend sockets reside in the dying netns
3. Stacked devices (e.g., ipvlan, macvlan, etc)
-> Removing the lower device also removes multiple upper
devices, each of which may reside in different namespaces.
In these cases, we will use unregister_netdevice_queue_net() to
queue such potential cross-netns devices for destruction.
unregister_netdevice_queue_net() takes net and dev. If dev resides
in the net, it simply calls unregister_netdevice_queue().
If dev_net(dev) is different from the net, it enqueues the device
to dev_net(dev)->dev_unreg_head and schedules the per-netns work.
When __rtnl_net_unlock() is called from the per-netns work (or another
thread already holding the lock), unregister_netdevice_many_net()
collects the queued devices and calls unregister_netdevice_many()
to perform the actual unregistration.
During netns dismantle, rtnl_net_flush_workqueue() is called at the
end of default_device_exit_batch() to ensure that cross-netns
devices in the other alive netns are unregistered.
Once RTNL is removed, a device could be moved to another netns while
being queued to net->dev_unreg_head.
__dev_change_net_namespace() handles this race by acquiring
net->dev_unreg_lock of both the old and new netns after dev_set_net()
and moving the device between their dev_unreg_head lists.
Since dev_set_net() and unregister_netdevice_queue_net() are
synchronised by netdev_lock(), the device is either queued to the
old netns's dev_unreg_head and then moved, or queued directly to
the new netns.
Note that unregister_netdevice_move_net() does not need to call
rtnl_net_queue_work() because __dev_change_net_namespace() is
(supposed to be) called with rtnl_net_lock(). (Not all callers
hold it yet, but the race does not happen until all callers
are converted and RTNL is removed.)
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/netdevice.h | 16 +++++++
include/net/net_namespace.h | 2 +
net/core/dev.c | 85 +++++++++++++++++++++++++++++++++++++
net/core/net_namespace.c | 2 +
net/core/rtnetlink.c | 4 ++
5 files changed, 109 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9981d637f8b5..53454db3611a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2241,6 +2241,9 @@ struct net_device {
struct list_head dev_list;
struct list_head napi_list;
struct list_head unreg_list;
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ struct list_head unreg_list_net;
+#endif
struct list_head close_list;
struct list_head ptype_all;
@@ -3472,6 +3475,19 @@ static inline void unregister_netdevice(struct net_device *dev)
unregister_netdevice_queue(dev, NULL);
}
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
+ struct list_head *head);
+void unregister_netdevice_many_net(struct net *net);
+#else
+static inline void unregister_netdevice_queue_net(struct net *net,
+ struct net_device *dev,
+ struct list_head *head)
+{
+ unregister_netdevice_queue(dev, head);
+}
+#endif
+
int netdev_refcnt_read(const struct net_device *dev);
void free_netdev(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index a989019af5f7..501af1999fe8 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -198,6 +198,8 @@ struct net {
/* Move to a better place when the config guard is removed. */
struct mutex rtnl_mutex;
struct work_struct rtnl_work;
+ struct list_head dev_unreg_head;
+ spinlock_t dev_unreg_lock;
#endif
#if IS_ENABLED(CONFIG_VSOCKETS)
struct netns_vsock vsock;
diff --git a/net/core/dev.c b/net/core/dev.c
index 48818a194fa5..0f0bf65f5bf9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12092,6 +12092,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
INIT_LIST_HEAD(&dev->napi_list);
INIT_LIST_HEAD(&dev->unreg_list);
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ INIT_LIST_HEAD(&dev->unreg_list_net);
+#endif
INIT_LIST_HEAD(&dev->close_list);
INIT_LIST_HEAD(&dev->link_watch_list);
INIT_LIST_HEAD(&dev->adj_list.upper);
@@ -12485,6 +12488,16 @@ void unregister_netdevice_many_notify(struct list_head *head,
synchronize_net();
list_for_each_entry(dev, head, unreg_list) {
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ struct net *net = dev_net(dev);
+
+ /* spin_lock() can be moved outside of the loop
+ * once the per-netns RTNL conversion completes.
+ */
+ spin_lock(&net->dev_unreg_lock);
+ list_del(&dev->unreg_list_net);
+ spin_unlock(&net->dev_unreg_lock);
+#endif
netdev_put(dev, &dev->dev_registered_tracker);
net_set_todo(dev);
cnt++;
@@ -12507,6 +12520,72 @@ void unregister_netdevice_many(struct list_head *head)
}
EXPORT_SYMBOL(unregister_netdevice_many);
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
+ struct list_head *head)
+{
+ netdev_lock(dev);
+
+ if (net_eq(dev_net(dev), net)) {
+ netdev_unlock(dev);
+ unregister_netdevice_queue(dev, head);
+ return;
+ }
+
+ net = dev_net(dev);
+
+ spin_lock(&net->dev_unreg_lock);
+
+ DEBUG_NET_WARN_ON_ONCE(!list_empty(&dev->unreg_list_net));
+ list_add_tail(&dev->unreg_list_net, &net->dev_unreg_head);
+ rtnl_net_queue_work(net);
+
+ spin_unlock(&net->dev_unreg_lock);
+
+ netdev_unlock(dev);
+}
+EXPORT_SYMBOL(unregister_netdevice_queue_net);
+
+static void unregister_netdevice_move_net(struct net *net_old,
+ struct net *net,
+ struct net_device *dev)
+{
+ if (net_old > net) {
+ spin_lock(&net->dev_unreg_lock);
+ spin_lock(&net_old->dev_unreg_lock);
+ } else {
+ spin_lock(&net_old->dev_unreg_lock);
+ spin_lock(&net->dev_unreg_lock);
+ }
+
+ if (!list_empty(&dev->unreg_list_net)) {
+ list_del(&dev->unreg_list_net);
+ list_add_tail(&dev->unreg_list_net, &net->dev_unreg_head);
+ }
+
+ spin_unlock(&net_old->dev_unreg_lock);
+ spin_unlock(&net->dev_unreg_lock);
+}
+
+void unregister_netdevice_many_net(struct net *net)
+{
+ struct net_device *dev, *tmp;
+ LIST_HEAD(unreg_head_net);
+ LIST_HEAD(unreg_head);
+
+ spin_lock(&net->dev_unreg_lock);
+ list_splice_init(&net->dev_unreg_head, &unreg_head_net);
+ spin_unlock(&net->dev_unreg_lock);
+
+ list_for_each_entry_safe(dev, tmp, &unreg_head_net, unreg_list_net) {
+ list_del_init(&dev->unreg_list_net);
+ list_add_tail(&dev->unreg_list, &unreg_head);
+ }
+
+ unregister_netdevice_many(&unreg_head);
+}
+#endif
+
/**
* unregister_netdev - remove device from the kernel
* @dev: device
@@ -12663,6 +12742,10 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
netdev_unlock(dev);
dev->ifindex = new_ifindex;
+#ifdef CONFIG_DEBUG_NET_SMALL_RTNL
+ unregister_netdevice_move_net(net_old, net, dev);
+#endif
+
if (new_name[0]) {
/* Rename the netdev to prepared name */
write_seqlock_bh(&netdev_rename_lock);
@@ -13105,6 +13188,8 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
}
unregister_netdevice_many(&dev_kill_list);
rtnl_unlock();
+
+ rtnl_net_flush_workqueue();
}
static struct pernet_operations __net_initdata default_device_ops = {
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index d1aeff9de580..578b48cf5318 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -423,6 +423,8 @@ static __net_init int preinit_net(struct net *net, struct user_namespace *user_n
mutex_init(&net->rtnl_mutex);
lock_set_cmp_fn(&net->rtnl_mutex, rtnl_net_lock_cmp_fn, NULL);
INIT_WORK(&net->rtnl_work, rtnl_net_work_func);
+ INIT_LIST_HEAD(&net->dev_unreg_head);
+ spin_lock_init(&net->dev_unreg_lock);
#endif
INIT_LIST_HEAD(&net->ptype_all);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 7959519e7375..544498d3c325 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -197,6 +197,7 @@ void __rtnl_net_unlock(struct net *net)
{
ASSERT_RTNL();
+ unregister_netdevice_many_net(net);
mutex_unlock(&net->rtnl_mutex);
}
EXPORT_SYMBOL(__rtnl_net_unlock);
@@ -290,6 +291,9 @@ void rtnl_net_work_func(struct work_struct *work)
{
struct net *net = container_of(work, struct net, rtnl_work);
+ if (list_empty(&net->dev_unreg_head))
+ return;
+
rtnl_net_lock(net);
rtnl_net_unlock(net);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 07/14] net: Call unregister_netdevice_many() per netns.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
For per-netns device unregistration, the list passed to
unregister_netdevice_many() must contain devices from a single
netns only (once all callers are converted).
Let's move collected devices in the following functions to
net->dev_unreg_head and let __rtnl_net_unlock() pass them to
unregister_netdevice_many().
* default_device_exit_batch()
* ops_exit_rtnl_list()
* __rtnl_kill_links()
This allows incremental conversion of each driver to support
per-netns device unregistration without affecting the normal
kernel where CONFIG_DEBUG_NET_SMALL_RTNL is disabled.
Note that this change unbatches synchronize_rcu() etc in
unregister_netdevice_many(), but we can later split it into
multiple stages to batch them again.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/netdevice.h | 6 ++++++
net/core/dev.c | 27 +++++++++++++++++++++++++++
net/core/net_namespace.c | 1 +
net/core/rtnetlink.c | 6 +++++-
4 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 53454db3611a..0cd26fb59806 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3479,6 +3479,7 @@ static inline void unregister_netdevice(struct net_device *dev)
void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
struct list_head *head);
void unregister_netdevice_many_net(struct net *net);
+void unregister_netdevice_queue_many_net(struct net *net, struct list_head *head);
#else
static inline void unregister_netdevice_queue_net(struct net *net,
struct net_device *dev,
@@ -3486,6 +3487,11 @@ static inline void unregister_netdevice_queue_net(struct net *net,
{
unregister_netdevice_queue(dev, head);
}
+
+static inline void unregister_netdevice_queue_many_net(struct net *net,
+ struct list_head *head)
+{
+}
#endif
int netdev_refcnt_read(const struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0f0bf65f5bf9..57fb4741d0ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12546,6 +12546,28 @@ void unregister_netdevice_queue_net(struct net *net, struct net_device *dev,
}
EXPORT_SYMBOL(unregister_netdevice_queue_net);
+void unregister_netdevice_queue_many_net(struct net *net, struct list_head *head)
+{
+ struct net_device *dev, *tmp;
+
+ spin_lock(&net->dev_unreg_lock);
+ list_for_each_entry_safe(dev, tmp, head, unreg_list) {
+ /* Once all cross-netns unregister_netdevice_queue() is
+ * converted to _net() (or for debugging), remove this check.
+ */
+ if (!net_eq(dev_net(dev), net))
+ continue;
+
+ DEBUG_NET_WARN_ONCE(!net_eq(dev_net(dev), net),
+ "%s was unregistered from a different netns.\n",
+ dev->name);
+
+ list_del_init(&dev->unreg_list);
+ list_move_tail(&dev->unreg_list_net, &net->dev_unreg_head);
+ }
+ spin_unlock(&net->dev_unreg_lock);
+}
+
static void unregister_netdevice_move_net(struct net *net_old,
struct net *net,
struct net_device *dev)
@@ -13179,12 +13201,17 @@ static void __net_exit default_device_exit_batch(struct list_head *net_list)
__rtnl_net_unlock(&init_net);
list_for_each_entry(net, net_list, exit_list) {
+ __rtnl_net_lock(net);
+
for_each_netdev_reverse(net, dev) {
if (dev->rtnl_link_ops && dev->rtnl_link_ops->dellink)
dev->rtnl_link_ops->dellink(dev, &dev_kill_list);
else
unregister_netdevice_queue(dev, &dev_kill_list);
}
+
+ unregister_netdevice_queue_many_net(net, &dev_kill_list);
+ __rtnl_net_unlock(net);
}
unregister_netdevice_many(&dev_kill_list);
rtnl_unlock();
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 578b48cf5318..a91d2b58aadd 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -181,6 +181,7 @@ static void ops_exit_rtnl_list(const struct list_head *ops_list,
ops->exit_rtnl(net, &dev_kill_list);
}
+ unregister_netdevice_queue_many_net(net, &dev_kill_list);
__rtnl_net_unlock(net);
}
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 544498d3c325..b129f793d851 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -714,8 +714,12 @@ void rtnl_link_unregister(struct rtnl_link_ops *ops)
down_write(&pernet_ops_rwsem);
rtnl_lock_unregistering_all();
- for_each_net(net)
+ for_each_net(net) {
+ __rtnl_net_lock(net);
__rtnl_kill_links(net, ops, &dev_kill_list);
+ unregister_netdevice_queue_many_net(net, &dev_kill_list);
+ __rtnl_net_unlock(net);
+ }
unregister_netdevice_many(&dev_kill_list);
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 08/14] veth: Support per-netns device unregistration.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
Currently, veth_dellink() unregisters both local and peer devices
synchronously under RTNL.
Once RTNL is removed, it can be called concurrently from different
netns.
Let's use xchg() and unregister_netdevice_queue_net() to support
per-netns device unregistration.
This way, each device is queued for destruction only once by
the winner of the race.
Note that the extra netdev_hold() ensures that @peer obtained by
the first xchg() is not freed during the subsequent access to
netdev_priv(peer). The 2nd xchg() overwrites @dev to balance
the refcount.
Tested:
1. Create two veth pairs (veth1-2, veth3-4) between two netns
(ns1 & ns2).
# ip netns add ns1
# ip netns add ns2
# ip -n ns1 link add veth1 type veth peer veth2 netns ns2
# ip -n ns1 link add veth3 type veth peer veth4 netns ns2
2. Run bpftrace to check if the same process does NOT
unregister the paired veth devices
# bpftrace -e '#include <linux/netdevice.h>
kprobe:free_netdev {
$dev = (struct net_device *)arg0;
printf("PID: %d | DEV: %s%s\n", pid, $dev->name, kstack());
}'
3. Remove veth2 in ns2 and check bpftrace output
# ip -n ns2 link del veth2
PID: 2194 | DEV: veth2
free_netdev+5
netdev_run_todo+4798
rtnl_dellink+1507
rtnetlink_rcv_msg+1791
...
PID: 448 | DEV: veth1
free_netdev+5
netdev_run_todo+4798
process_scheduled_works+2538
...
4. Remove ns2 (thus veth4) and check bpftrace output
# ip netns del ns2
PID: 571 | DEV: veth4
free_netdev+5
netdev_run_todo+4798
default_device_exit_batch+2271
ops_undo_list+993
cleanup_net+1122
process_scheduled_works+2538
...
PID: 441 | DEV: veth3
free_netdev+5
netdev_run_todo+4798
process_scheduled_works+2538
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/veth.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1c5142149175..8170bf33ccf9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -77,6 +77,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ netdevice_tracker peer_tracker;
};
struct veth_xdp_tx_bq {
@@ -1901,15 +1902,17 @@ static int veth_newlink(struct net_device *dev,
priv = netdev_priv(dev);
rcu_assign_pointer(priv->peer, peer);
+ netdev_hold(peer, &priv->peer_tracker, GFP_KERNEL);
err = veth_init_queues(dev, tb);
if (err)
goto err_queues;
priv = netdev_priv(peer);
rcu_assign_pointer(priv->peer, dev);
+ netdev_hold(dev, &priv->peer_tracker, GFP_KERNEL);
err = veth_init_queues(peer, tb);
if (err)
- goto err_queues;
+ goto err_peer_queues;
veth_disable_gro(dev);
/* update XDP supported features */
@@ -1918,7 +1921,11 @@ static int veth_newlink(struct net_device *dev,
return 0;
+err_peer_queues:
+ netdev_put(dev, &priv->peer_tracker);
+ priv = netdev_priv(dev);
err_queues:
+ netdev_put(peer, &priv->peer_tracker);
unregister_netdevice(dev);
err_register_dev:
/* nothing to do */
@@ -1933,24 +1940,25 @@ static int veth_newlink(struct net_device *dev,
static void veth_dellink(struct net_device *dev, struct list_head *head)
{
- struct veth_priv *priv;
+ netdevice_tracker *peer_tracker;
struct net_device *peer;
+ struct veth_priv *priv;
priv = netdev_priv(dev);
- peer = rtnl_dereference(priv->peer);
+ peer_tracker = &priv->peer_tracker;
+ peer = unrcu_pointer(xchg(&priv->peer, NULL));
+ if (!peer)
+ return;
- /* Note : dellink() is called from default_device_exit_batch(),
- * before a rcu_synchronize() point. The devices are guaranteed
- * not being freed before one RCU grace period.
- */
- RCU_INIT_POINTER(priv->peer, NULL);
unregister_netdevice_queue(dev, head);
- if (peer) {
- priv = netdev_priv(peer);
- RCU_INIT_POINTER(priv->peer, NULL);
- unregister_netdevice_queue(peer, head);
- }
+ priv = netdev_priv(peer);
+ dev = unrcu_pointer(xchg(&priv->peer, NULL));
+ if (dev)
+ unregister_netdevice_queue_net(dev_net(dev), peer, head);
+
+ netdev_put(peer, peer_tracker);
+ netdev_put(dev, &priv->peer_tracker);
}
static const struct nla_policy veth_policy[VETH_INFO_MAX + 1] = {
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 10/14] bareudp: Support per-netns netdev unregistration.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
bareudp_exit_rtnl_net() iterates bareudp devices whose sockets
are in the dying netns and queues them for destruction.
So the devices may reside in different netns.
Let's use unregister_netdevice_queue_net() to support per-netns
device unregistration.
list_del() is changed to list_del_init() to avoid queueing the
same device twice.
Even after bareudp_exit_rtnl_net() queues a cross-netns bareudp
device, bareudp_dellink() could be called concurrently for it
(once RTNL is removed). In such a case, __rtnl_net_unlock() will
perform the unregistration.
Note that bareudp uses register_pernet_subsys() instead of _device(),
so default_device_exit_batch() guarantees that the async per-netns
works are flushed before ->exit().
Tested:
1. Create bareudp device across two netns.
# ip netns add ns1
# ip netns add ns2
# ip -n ns1 link add bareudp0 link-netns ns2 type bareudp \
dstport 9292 ethertype ipv4
2. Run bpftrace to check that bareudp_uninit() is called between
->exit_rtnl() and ->exit().
# bpftrace -e '#include <linux/netdevice.h>
kprobe:bareudp_uninit {
$dev = (struct net_device *)arg0;
printf("PID: %d | DEV: %s%s\n", pid, $dev->name, kstack());
}
kprobe:bareudp_exit_rtnl_net,
kprobe:bareudp_exit_net {
printf("PID: %d%s\n", pid, kstack());
}'
3. Remove the netns where the bareudp socket resides
# ip netns del ns2
Now, we can see bareudp0 is unregistered by per-netns work
instead of cleanup_net() and it finishes before ->exit() to
avoid WARN_ON_ONCE(!list_empty(&gn->sock_list)) there.
PID: 576
bareudp_exit_rtnl_net+5
ops_undo_list+702
cleanup_net+1122
process_scheduled_works+2538
...
PID: 470 | DEV: bareudp0
bareudp_uninit+5
unregister_netdevice_many_notify+7129
unregister_netdevice_many_net+1050
rtnl_net_work_func+136
process_scheduled_works+2538
...
PID: 576
bareudp_exit_net+5
ops_undo_list+1064
cleanup_net+1122
process_scheduled_works+2538
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/bareudp.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
index 7dedf4867e7b..c3b5ed52d877 100644
--- a/drivers/net/bareudp.c
+++ b/drivers/net/bareudp.c
@@ -701,12 +701,13 @@ static int bareudp_link_config(struct net_device *dev,
return 0;
}
-static void __bareudp_dellink(struct net_device *dev, struct list_head *head)
+static void __bareudp_dellink(struct net *net, struct net_device *dev,
+ struct list_head *head)
{
struct bareudp_dev *bareudp = netdev_priv(dev);
- list_del(&bareudp->next);
- unregister_netdevice_queue(dev, head);
+ list_del_init(&bareudp->next);
+ unregister_netdevice_queue_net(net, dev, head);
}
static void bareudp_dellink(struct net_device *dev, struct list_head *head)
@@ -717,7 +718,8 @@ static void bareudp_dellink(struct net_device *dev, struct list_head *head)
bn = net_generic(bareudp->net, bareudp_net_id);
mutex_lock(&bn->lock);
- __bareudp_dellink(dev, head);
+ if (!list_empty(&bareudp->next))
+ __bareudp_dellink(dev_net(dev), dev, head);
mutex_unlock(&bn->lock);
}
@@ -811,14 +813,22 @@ static void __net_exit bareudp_exit_rtnl_net(struct net *net,
mutex_lock(&bn->lock);
list_for_each_entry_safe(bareudp, next, &bn->bareudp_list, next)
- __bareudp_dellink(bareudp->dev, dev_kill_list);
+ __bareudp_dellink(net, bareudp->dev, dev_kill_list);
mutex_unlock(&bn->lock);
}
+static void __net_exit bareudp_exit_net(struct net *net)
+{
+ struct bareudp_net *bn = net_generic(net, bareudp_net_id);
+
+ WARN_ON_ONCE(!list_empty(&bn->bareudp_list));
+}
+
static struct pernet_operations bareudp_net_ops = {
.init = bareudp_init_net,
.exit_rtnl = bareudp_exit_rtnl_net,
+ .exit = bareudp_exit_net,
.id = &bareudp_net_id,
.size = sizeof(struct bareudp_net),
};
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 11/14] ipvlan: Convert ipvl_port.count to refcount_t.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
struct ipvl_port is shared between a lower device and its upper
ipvlan devices.
While each upper device can always access ipvl_port safely via
ipvlan_dev.port, the lower device relies on RTNL to access it
via net_device.rx_handler_data.
Once RTNL is removed, the lower device cannot read ipvl_port safely
in ipvlan_device_event() because the port could be freed concurrently
and net_device.rx_handler_data is set to NULL if the last ipvlan
device in another namespace is unregistered.
Let's convert ipvl_port.count to refcount_t and use RCU along with
refcount_inc_not_zero() in ipvlan_device_event().
netdev_put() in ipvlan_port_destroy() is also moved down after
cancel_work_sync(), which is the last user of port->dev.
Note that ipvlan->port is now set in ipvlan_init() so that it can
be used in ipvlan_uninit(), instead of ipvlan_port_get_rtnl()
(rtnl_dereference()).
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/ipvlan/ipvlan.h | 2 +-
drivers/net/ipvlan/ipvlan_main.c | 75 ++++++++++++++++++++++----------
2 files changed, 52 insertions(+), 25 deletions(-)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 80f84fc87008..78f9107fa752 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -96,7 +96,7 @@ struct ipvl_port {
u16 dev_id_start;
struct work_struct wq;
struct sk_buff_head backlog;
- int count;
+ refcount_t count;
struct ida ida;
netdevice_tracker dev_tracker;
};
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index ed46439a9f4e..b4906a8d24ef 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -86,6 +86,7 @@ static int ipvlan_port_create(struct net_device *dev)
goto err;
netdev_hold(dev, &port->dev_tracker, GFP_KERNEL);
+
return 0;
err:
@@ -93,16 +94,18 @@ static int ipvlan_port_create(struct net_device *dev)
return err;
}
-static void ipvlan_port_destroy(struct net_device *dev)
+static void ipvlan_port_destroy(struct ipvl_port *port)
{
- struct ipvl_port *port = ipvlan_port_get_rtnl(dev);
+ struct net_device *dev = port->dev;
struct sk_buff *skb;
- netdev_put(dev, &port->dev_tracker);
if (port->mode == IPVLAN_MODE_L3S)
ipvlan_l3s_unregister(port);
+
netdev_rx_handler_unregister(dev);
cancel_work_sync(&port->wq);
+ netdev_put(dev, &port->dev_tracker);
+
while ((skb = __skb_dequeue(&port->backlog)) != NULL) {
dev_put(skb->dev);
kfree_skb(skb);
@@ -111,6 +114,27 @@ static void ipvlan_port_destroy(struct net_device *dev)
kfree(port);
}
+static void ipvlan_port_put(struct ipvl_port *port)
+{
+ if (refcount_dec_and_test(&port->count))
+ ipvlan_port_destroy(port);
+}
+
+static struct ipvl_port *ipvlan_port_get(struct net_device *dev)
+{
+ struct ipvl_port *port = NULL;
+
+ rcu_read_lock();
+ if (netif_is_ipvlan_port(dev)) {
+ port = ipvlan_port_get_rcu(dev);
+ if (!refcount_inc_not_zero(&port->count))
+ port = NULL;
+ }
+ rcu_read_unlock();
+
+ return port;
+}
+
#define IPVLAN_ALWAYS_ON_OFLOADS \
(NETIF_F_SG | NETIF_F_HW_CSUM | \
NETIF_F_GSO_ROBUST | NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL)
@@ -159,24 +183,24 @@ static int ipvlan_init(struct net_device *dev)
free_percpu(ipvlan->pcpu_stats);
return err;
}
+ port = ipvlan_port_get_rtnl(phy_dev);
+ refcount_set(&port->count, 1);
+ } else {
+ port = ipvlan_port_get_rtnl(phy_dev);
+ refcount_inc(&port->count);
}
- port = ipvlan_port_get_rtnl(phy_dev);
- port->count += 1;
+
+ ipvlan->port = port;
+
return 0;
}
static void ipvlan_uninit(struct net_device *dev)
{
struct ipvl_dev *ipvlan = netdev_priv(dev);
- struct net_device *phy_dev = ipvlan->phy_dev;
- struct ipvl_port *port;
free_percpu(ipvlan->pcpu_stats);
-
- port = ipvlan_port_get_rtnl(phy_dev);
- port->count -= 1;
- if (!port->count)
- ipvlan_port_destroy(port->dev);
+ ipvlan_port_put(ipvlan->port);
}
static int ipvlan_open(struct net_device *dev)
@@ -594,9 +618,7 @@ int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params,
if (err < 0)
return err;
- /* ipvlan_init() would have created the port, if required */
- port = ipvlan_port_get_rtnl(phy_dev);
- ipvlan->port = port;
+ port = ipvlan->port;
/* If the port-id base is at the MAX value, then wrap it around and
* begin from 0x1 again. This may be due to a busy system where lots
@@ -729,14 +751,13 @@ static int ipvlan_device_event(struct notifier_block *unused,
struct netdev_notifier_pre_changeaddr_info *prechaddr_info;
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct ipvl_dev *ipvlan, *next;
+ int err, ret = NOTIFY_DONE;
struct ipvl_port *port;
LIST_HEAD(lst_kill);
- int err;
-
- if (!netif_is_ipvlan_port(dev))
- return NOTIFY_DONE;
- port = ipvlan_port_get_rtnl(dev);
+ port = ipvlan_port_get(dev);
+ if (!port)
+ return ret;
switch (event) {
case NETDEV_UP:
@@ -788,8 +809,10 @@ static int ipvlan_device_event(struct notifier_block *unused,
err = netif_pre_changeaddr_notify(ipvlan->dev,
prechaddr_info->dev_addr,
extack);
- if (err)
- return notifier_from_errno(err);
+ if (err) {
+ ret = notifier_from_errno(err);
+ break;
+ }
}
break;
@@ -802,7 +825,8 @@ static int ipvlan_device_event(struct notifier_block *unused,
case NETDEV_PRE_TYPE_CHANGE:
/* Forbid underlying device to change its type. */
- return NOTIFY_BAD;
+ ret = NOTIFY_BAD;
+ break;
case NETDEV_NOTIFY_PEERS:
case NETDEV_BONDING_FAILOVER:
@@ -810,7 +834,10 @@ static int ipvlan_device_event(struct notifier_block *unused,
list_for_each_entry(ipvlan, &port->ipvlans, pnode)
call_netdevice_notifiers(event, ipvlan->dev);
}
- return NOTIFY_DONE;
+
+ ipvlan_port_put(port);
+
+ return ret;
}
/* the caller must held the addrs lock */
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 09/14] bareudp: Protect bareudp_list with mutex.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
struct bareudp_dev.net is the netns where the backend bareudp
socket resides.
struct bareudp_dev is linked to the bareudp_net.bareudp_list of
the socket's netns.
During netns dismantle or module unload, bareudp_exit_rtnl_net()
iterates the list and queues devices for destruction regardless
of the devices' netns.
Thus, once RTNL is removed, the list can be modified concurrently
from different netns due to device removal.
Let's protect it with per-netns mutex.
bareudp_newlink() is still protected by rtnl_net_lock()s, so
acquiring gn->lock twice in bareudp_find_dev() and
bareudp_configure() is not a problem.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/bareudp.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)
diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
index 5ef841c85526..7dedf4867e7b 100644
--- a/drivers/net/bareudp.c
+++ b/drivers/net/bareudp.c
@@ -36,6 +36,7 @@ static unsigned int bareudp_net_id;
struct bareudp_net {
struct list_head bareudp_list;
+ struct mutex lock;
};
struct bareudp_conf {
@@ -636,10 +637,15 @@ static struct bareudp_dev *bareudp_find_dev(struct bareudp_net *bn,
{
struct bareudp_dev *bareudp, *t = NULL;
+ mutex_lock(&bn->lock);
+
list_for_each_entry(bareudp, &bn->bareudp_list, next) {
if (conf->port == bareudp->port)
t = bareudp;
}
+
+ mutex_unlock(&bn->lock);
+
return t;
}
@@ -675,7 +681,10 @@ static int bareudp_configure(struct net *net, struct net_device *dev,
if (err)
return err;
+ mutex_lock(&bn->lock);
list_add(&bareudp->next, &bn->bareudp_list);
+ mutex_unlock(&bn->lock);
+
return 0;
}
@@ -692,7 +701,7 @@ static int bareudp_link_config(struct net_device *dev,
return 0;
}
-static void bareudp_dellink(struct net_device *dev, struct list_head *head)
+static void __bareudp_dellink(struct net_device *dev, struct list_head *head)
{
struct bareudp_dev *bareudp = netdev_priv(dev);
@@ -700,6 +709,18 @@ static void bareudp_dellink(struct net_device *dev, struct list_head *head)
unregister_netdevice_queue(dev, head);
}
+static void bareudp_dellink(struct net_device *dev, struct list_head *head)
+{
+ struct bareudp_dev *bareudp = netdev_priv(dev);
+ struct bareudp_net *bn;
+
+ bn = net_generic(bareudp->net, bareudp_net_id);
+
+ mutex_lock(&bn->lock);
+ __bareudp_dellink(dev, head);
+ mutex_unlock(&bn->lock);
+}
+
static int bareudp_newlink(struct net_device *dev,
struct rtnl_newlink_params *params,
struct netlink_ext_ack *extack)
@@ -776,6 +797,8 @@ static __net_init int bareudp_init_net(struct net *net)
struct bareudp_net *bn = net_generic(net, bareudp_net_id);
INIT_LIST_HEAD(&bn->bareudp_list);
+ mutex_init(&bn->lock);
+
return 0;
}
@@ -785,8 +808,12 @@ static void __net_exit bareudp_exit_rtnl_net(struct net *net,
struct bareudp_net *bn = net_generic(net, bareudp_net_id);
struct bareudp_dev *bareudp, *next;
+ mutex_lock(&bn->lock);
+
list_for_each_entry_safe(bareudp, next, &bn->bareudp_list, next)
- bareudp_dellink(bareudp->dev, dev_kill_list);
+ __bareudp_dellink(bareudp->dev, dev_kill_list);
+
+ mutex_unlock(&bn->lock);
}
static struct pernet_operations bareudp_net_ops = {
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 12/14] ipvlan: Synchronise ipvlan_init() and ipvlan_uninit() for the same lower dev.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
ipvlan_uninit() for the last ipvlan device resets the lower device's
rx_handler_data to NULL.
Once RTNL is removed, ipvlan_init() would race with ipvlan_uninit(),
which could leak a newly allocated ipvl_port.
ipvlan_init() ipvlan_uninit()
| |- if (refcount_dec_and_test(old_port))
... |- ipvlan_port_destroy(old_port)
| '
|- refcount_inc_not_zero(old_port) <-- fails
|- ipvlan_port_create(phy_dev) .
|- new_port = kzalloc() |
|- phy_dev->rx_handler_data = new_port
|- phy_dev->rx_handler_data = NULL
...
`- kfree(old_port);
Let's synchronise the two by holding the lower device's netdev_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/ipvlan/ipvlan_main.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index b4906a8d24ef..7adad781e9b5 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -177,9 +177,12 @@ static int ipvlan_init(struct net_device *dev)
if (!ipvlan->pcpu_stats)
return -ENOMEM;
+ netdev_lock(phy_dev);
+
if (!netif_is_ipvlan_port(phy_dev)) {
err = ipvlan_port_create(phy_dev);
if (err < 0) {
+ netdev_unlock(phy_dev);
free_percpu(ipvlan->pcpu_stats);
return err;
}
@@ -190,6 +193,8 @@ static int ipvlan_init(struct net_device *dev)
refcount_inc(&port->count);
}
+ netdev_unlock(phy_dev);
+
ipvlan->port = port;
return 0;
@@ -198,9 +203,19 @@ static int ipvlan_init(struct net_device *dev)
static void ipvlan_uninit(struct net_device *dev)
{
struct ipvl_dev *ipvlan = netdev_priv(dev);
+ netdevice_tracker dev_tracker;
+ struct net_device *phy_dev;
free_percpu(ipvlan->pcpu_stats);
+
+ phy_dev = ipvlan->phy_dev;
+ netdev_hold(phy_dev, &dev_tracker, GFP_KERNEL);
+ netdev_lock(phy_dev);
+
ipvlan_port_put(ipvlan->port);
+
+ netdev_unlock(phy_dev);
+ netdev_put(phy_dev, &dev_tracker);
}
static int ipvlan_open(struct net_device *dev)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 13/14] ipvlan: Protect ipvl_port.ipvlans with mutex.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
struct ipvl_port is shared between a lower device and its upper
ipvlan devices.
All upper devices are linked to ipvl_port.ipvlans.
Once RTNL is removed, the list can be modified concurrently from
different netns due to device removal.
Let's protect it with a per-port mutex.
NETDEV_PRECHANGEUPPER and NETDEV_CHANGEUPPER are explicitly
skipped to avoid deadlock for netdev_upper_dev_unlink() called
from NETDEV_UNREGISTER.
Note that __ipvtap_dellink() and struct ipvtap_dev is moved to
ipvlan.c/h for CONFIG_IPVLAN=y but CONFIG_IPVTAP=m.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/ipvlan/ipvlan.h | 14 ++++++++-
drivers/net/ipvlan/ipvlan_main.c | 54 +++++++++++++++++++++++++++++---
drivers/net/ipvlan/ipvtap.c | 15 +++------
3 files changed, 67 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 78f9107fa752..a0736f5c89f6 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -16,6 +16,9 @@
#include <linux/if_arp.h>
#include <linux/if_link.h>
#include <linux/if_vlan.h>
+#if IS_ENABLED(CONFIG_IPVTAP)
+#include <linux/if_tap.h>
+#endif
#include <linux/ip.h>
#include <linux/inetdevice.h>
#include <linux/netfilter.h>
@@ -91,6 +94,7 @@ struct ipvl_port {
struct hlist_head hlhead[IPVLAN_HASH_SIZE];
spinlock_t addrs_lock; /* guards hash-table and addrs */
struct list_head ipvlans;
+ struct mutex pnodes_lock;
u16 mode;
u16 flags;
u16 dev_id_start;
@@ -168,7 +172,6 @@ void ipvlan_count_rx(const struct ipvl_dev *ipvlan,
unsigned int len, bool success, bool mcast);
int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params,
struct netlink_ext_ack *extack);
-void ipvlan_link_delete(struct net_device *dev, struct list_head *head);
void ipvlan_link_setup(struct net_device *dev);
int ipvlan_link_register(struct rtnl_link_ops *ops);
#ifdef CONFIG_IPVLAN_L3S
@@ -207,4 +210,13 @@ static inline bool netif_is_ipvlan_port(const struct net_device *dev)
return rcu_access_pointer(dev->rx_handler) == ipvlan_handle_frame;
}
+#if IS_ENABLED(CONFIG_IPVTAP)
+struct ipvtap_dev {
+ struct ipvl_dev vlan;
+ struct tap_dev tap;
+};
+
+void __ipvtap_dellink(struct net_device *dev, struct list_head *head);
+#endif
+
#endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 7adad781e9b5..41024fe27b78 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -16,6 +16,8 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval,
ASSERT_RTNL();
if (port->mode != nval) {
+ mutex_lock(&port->pnodes_lock);
+
list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
flags = ipvlan->dev->flags;
if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S) {
@@ -40,6 +42,8 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval,
ipvlan_l3s_unregister(port);
}
port->mode = nval;
+
+ mutex_unlock(&port->pnodes_lock);
}
return 0;
@@ -56,6 +60,8 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval,
NULL);
}
+ mutex_unlock(&port->pnodes_lock);
+
return err;
}
@@ -76,6 +82,7 @@ static int ipvlan_port_create(struct net_device *dev)
INIT_HLIST_HEAD(&port->hlhead[idx]);
spin_lock_init(&port->addrs_lock);
+ mutex_init(&port->pnodes_lock);
skb_queue_head_init(&port->backlog);
INIT_WORK(&port->wq, ipvlan_process_multicast);
ida_init(&port->ida);
@@ -676,7 +683,10 @@ int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params,
if (err)
goto unlink_netdev;
+ mutex_lock(&port->pnodes_lock);
list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
+ mutex_unlock(&port->pnodes_lock);
+
netif_stacked_transfer_operstate(phy_dev, dev);
return 0;
@@ -690,7 +700,7 @@ int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params,
}
EXPORT_SYMBOL_GPL(ipvlan_link_new);
-void ipvlan_link_delete(struct net_device *dev, struct list_head *head)
+static void __ipvlan_link_delete(struct net_device *dev, struct list_head *head)
{
struct ipvl_dev *ipvlan = netdev_priv(dev);
struct ipvl_addr *addr, *next;
@@ -708,7 +718,27 @@ void ipvlan_link_delete(struct net_device *dev, struct list_head *head)
unregister_netdevice_queue(dev, head);
netdev_upper_dev_unlink(ipvlan->phy_dev, dev);
}
-EXPORT_SYMBOL_GPL(ipvlan_link_delete);
+
+static void ipvlan_link_delete(struct net_device *dev, struct list_head *head)
+{
+ struct ipvl_dev *ipvlan = netdev_priv(dev);
+
+ mutex_lock(&ipvlan->port->pnodes_lock);
+ __ipvlan_link_delete(dev, head);
+ mutex_unlock(&ipvlan->port->pnodes_lock);
+}
+
+#if IS_ENABLED(CONFIG_IPVTAP)
+void __ipvtap_dellink(struct net_device *dev, struct list_head *head)
+{
+ struct ipvtap_dev *vlantap = netdev_priv(dev);
+
+ netdev_rx_handler_unregister(dev);
+ tap_del_queues(&vlantap->tap);
+ __ipvlan_link_delete(dev, head);
+}
+EXPORT_SYMBOL_GPL(__ipvtap_dellink);
+#endif
void ipvlan_link_setup(struct net_device *dev)
{
@@ -770,10 +800,16 @@ static int ipvlan_device_event(struct notifier_block *unused,
struct ipvl_port *port;
LIST_HEAD(lst_kill);
+ if (event == NETDEV_PRECHANGEUPPER ||
+ event == NETDEV_CHANGEUPPER)
+ return ret;
+
port = ipvlan_port_get(dev);
if (!port)
return ret;
+ mutex_lock(&port->pnodes_lock);
+
switch (event) {
case NETDEV_UP:
case NETDEV_DOWN:
@@ -800,9 +836,15 @@ static int ipvlan_device_event(struct notifier_block *unused,
if (dev->reg_state != NETREG_UNREGISTERING)
break;
- list_for_each_entry_safe(ipvlan, next, &port->ipvlans, pnode)
- ipvlan->dev->rtnl_link_ops->dellink(ipvlan->dev,
- &lst_kill);
+ list_for_each_entry_safe(ipvlan, next, &port->ipvlans, pnode) {
+#if IS_ENABLED(CONFIG_IPVTAP)
+ if (ipvlan->dev->rtnl_link_ops != &ipvlan_link_ops)
+ __ipvtap_dellink(ipvlan->dev, &lst_kill);
+ else
+#endif
+ __ipvlan_link_delete(ipvlan->dev, &lst_kill);
+ }
+
unregister_netdevice_many(&lst_kill);
break;
@@ -850,6 +892,8 @@ static int ipvlan_device_event(struct notifier_block *unused,
call_netdevice_notifiers(event, ipvlan->dev);
}
+ mutex_unlock(&port->pnodes_lock);
+
ipvlan_port_put(port);
return ret;
diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c
index 2d6bbddd1edd..17b0dd7cf73b 100644
--- a/drivers/net/ipvlan/ipvtap.c
+++ b/drivers/net/ipvlan/ipvtap.c
@@ -2,7 +2,6 @@
#include <linux/etherdevice.h>
#include "ipvlan.h"
#include <linux/if_vlan.h>
-#include <linux/if_tap.h>
#include <linux/interrupt.h>
#include <linux/nsproxy.h>
#include <linux/compat.h>
@@ -43,11 +42,6 @@ static struct class ipvtap_class = {
.namespace = ipvtap_net_namespace,
};
-struct ipvtap_dev {
- struct ipvl_dev vlan;
- struct tap_dev tap;
-};
-
static void ipvtap_count_tx_dropped(struct tap_dev *tap)
{
struct ipvtap_dev *vlantap = container_of(tap, struct ipvtap_dev, tap);
@@ -112,11 +106,12 @@ static int ipvtap_newlink(struct net_device *dev,
static void ipvtap_dellink(struct net_device *dev,
struct list_head *head)
{
- struct ipvtap_dev *vlan = netdev_priv(dev);
+ struct ipvtap_dev *vlantap = netdev_priv(dev);
+ struct ipvl_port *port = vlantap->vlan.port;
- netdev_rx_handler_unregister(dev);
- tap_del_queues(&vlan->tap);
- ipvlan_link_delete(dev, head);
+ mutex_lock(&port->pnodes_lock);
+ __ipvtap_dellink(dev, head);
+ mutex_unlock(&port->pnodes_lock);
}
static void ipvtap_setup(struct net_device *dev)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v1 net-next 14/14] ipvlan: Support per-netns netdev unregistration.
From: Kuniyuki Iwashima @ 2026-07-01 21:41 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn
Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
In-Reply-To: <20260701214334.266991-1-kuniyu@google.com>
When a lower device is unregistered, its upper ipvlan devices
must also be unregistered. However, these upper devices may
reside in different netns than the lower device.
Let's use unregister_netdevice_queue_net() to support per-netns
device unregistration for ipvlan.
The new dying flag in struct ipvl_dev is used to avoid a race
that ipvlan_link_delete() is called while its lower device is
being removed in ipvlan_device_event().
If dying is true in ipvlan_link_delete(), the ipvlan device is
already destructed but not yet unregistered. In this case,
unregistration will be done in __rtnl_net_unlock() of the
->dellink() caller.
Tested:
1. Create veth in ns1 and two ipvlan devices in ns2 and ns3.
# ip netns add ns1
# ip netns add ns2
# ip netns add ns3
# ip -n ns1 link add veth0 type veth peer veth1
# ip -n ns2 link add ipvl2 link veth0 link-netns ns1 type ipvlan mode l2
# ip -n ns3 link add ipvl3 link veth0 link-netns ns1 type ipvlan mode l2
2. Run bpftrace to check that veth is unregistered first but
wait ipvlan to be unregistered
# bpftrace -e '#include <linux/netdevice.h>
kprobe:ipvlan_uninit,
kprobe:veth_dellink,
kprobe:free_netdev {
$dev = (struct net_device *)arg0;
printf("PID: %d | DEV: %s%s\n", pid, $dev->name, kstack());
}'
3. Remove the lower veth0 in ns1.
# ip -n ns1 link del veth0
We can see that veth0 is freed after unregistering ipvl2 and ipvl3
in per-netns work because ipvl_port holds refcount of veth0.
PID: 2010 | DEV: veth0
veth_dellink+5
rtnl_dellink+1213
rtnetlink_rcv_msg+1791
...
PID: 440 | DEV: ipvl2
ipvlan_uninit+5
unregister_netdevice_many_notify+7129
unregister_netdevice_many_net+1050
rtnl_net_work_func+136
process_scheduled_works+2538
...
PID: 440 | DEV: ipvl2
free_netdev+5
netdev_run_todo+4798
process_scheduled_works+2538
...
PID: 440 | DEV: ipvl3
ipvlan_uninit+5
unregister_netdevice_many_notify+7129
unregister_netdevice_many_net+1050
rtnl_net_work_func+136
process_scheduled_works+2538
...
PID: 2010 | DEV: veth0
free_netdev+5
netdev_run_todo+4798
rtnl_dellink+1507
rtnetlink_rcv_msg+1791
...
PID: 440 | DEV: ipvl3
free_netdev+5
netdev_run_todo+4798
process_scheduled_works+2538
...
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
drivers/net/ipvlan/ipvlan.h | 4 +++-
drivers/net/ipvlan/ipvlan_main.c | 25 ++++++++++++++++---------
drivers/net/ipvlan/ipvtap.c | 3 ++-
3 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index a0736f5c89f6..a83313244add 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -72,6 +72,7 @@ struct ipvl_dev {
DECLARE_BITMAP(mac_filters, IPVLAN_MAC_FILTER_SIZE);
netdev_features_t sfeatures;
u32 msg_enable;
+ bool dying;
};
struct ipvl_addr {
@@ -216,7 +217,8 @@ struct ipvtap_dev {
struct tap_dev tap;
};
-void __ipvtap_dellink(struct net_device *dev, struct list_head *head);
+void __ipvtap_dellink(struct net *net, struct net_device *dev,
+ struct list_head *head);
#endif
#endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 41024fe27b78..7e2cf43ca78a 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -700,7 +700,8 @@ int ipvlan_link_new(struct net_device *dev, struct rtnl_newlink_params *params,
}
EXPORT_SYMBOL_GPL(ipvlan_link_new);
-static void __ipvlan_link_delete(struct net_device *dev, struct list_head *head)
+static void __ipvlan_link_delete(struct net *net, struct net_device *dev,
+ struct list_head *head)
{
struct ipvl_dev *ipvlan = netdev_priv(dev);
struct ipvl_addr *addr, *next;
@@ -715,7 +716,7 @@ static void __ipvlan_link_delete(struct net_device *dev, struct list_head *head)
ida_free(&ipvlan->port->ida, dev->dev_id);
list_del_rcu(&ipvlan->pnode);
- unregister_netdevice_queue(dev, head);
+ unregister_netdevice_queue_net(net, dev, head);
netdev_upper_dev_unlink(ipvlan->phy_dev, dev);
}
@@ -724,18 +725,20 @@ static void ipvlan_link_delete(struct net_device *dev, struct list_head *head)
struct ipvl_dev *ipvlan = netdev_priv(dev);
mutex_lock(&ipvlan->port->pnodes_lock);
- __ipvlan_link_delete(dev, head);
+ if (!ipvlan->dying)
+ __ipvlan_link_delete(dev_net(dev), dev, head);
mutex_unlock(&ipvlan->port->pnodes_lock);
}
#if IS_ENABLED(CONFIG_IPVTAP)
-void __ipvtap_dellink(struct net_device *dev, struct list_head *head)
+void __ipvtap_dellink(struct net *net, struct net_device *dev,
+ struct list_head *head)
{
struct ipvtap_dev *vlantap = netdev_priv(dev);
netdev_rx_handler_unregister(dev);
tap_del_queues(&vlantap->tap);
- __ipvlan_link_delete(dev, head);
+ __ipvlan_link_delete(net, dev, head);
}
EXPORT_SYMBOL_GPL(__ipvtap_dellink);
#endif
@@ -832,22 +835,26 @@ static int ipvlan_device_event(struct notifier_block *unused,
ipvlan_migrate_l3s_hook(oldnet, newnet);
break;
}
- case NETDEV_UNREGISTER:
+ case NETDEV_UNREGISTER: {
+ struct net *net = dev_net(dev);
+
if (dev->reg_state != NETREG_UNREGISTERING)
break;
list_for_each_entry_safe(ipvlan, next, &port->ipvlans, pnode) {
+ ipvlan->dying = true;
+
#if IS_ENABLED(CONFIG_IPVTAP)
if (ipvlan->dev->rtnl_link_ops != &ipvlan_link_ops)
- __ipvtap_dellink(ipvlan->dev, &lst_kill);
+ __ipvtap_dellink(net, ipvlan->dev, &lst_kill);
else
#endif
- __ipvlan_link_delete(ipvlan->dev, &lst_kill);
+ __ipvlan_link_delete(net, ipvlan->dev, &lst_kill);
}
unregister_netdevice_many(&lst_kill);
break;
-
+ }
case NETDEV_FEAT_CHANGE:
list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
netif_inherit_tso_max(ipvlan->dev, dev);
diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c
index 17b0dd7cf73b..b790959c03f5 100644
--- a/drivers/net/ipvlan/ipvtap.c
+++ b/drivers/net/ipvlan/ipvtap.c
@@ -110,7 +110,8 @@ static void ipvtap_dellink(struct net_device *dev,
struct ipvl_port *port = vlantap->vlan.port;
mutex_lock(&port->pnodes_lock);
- __ipvtap_dellink(dev, head);
+ if (!vlantap->vlan.dying)
+ __ipvtap_dellink(dev_net(dev), dev, head);
mutex_unlock(&port->pnodes_lock);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v2 1/2] landlock: fix TCP Fast Open connection bypass
From: Matthieu Buffet @ 2026-07-01 21:46 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Bryam Vargas, Günther Noack, linux-security-module,
Mikhail Ivanov, Paul Moore, Eric Dumazet, Neal Cardwell,
linux-kernel, netdev, Matthieu Buffet
In-Reply-To: <c21cf4f3-6c21-4170-b578-13c1bfd48b87@buffet.re>
The documentation of the socket_connect() LSM hook states that it
controls connecting a socket to a remote address. It has not been the
case since the addition of TCP Fast Open (RFC 7413) support, which allows
opening a TCP connection (thus, setting a socket's destination address)
via the MSG_FASTOPEN flag passed to sendto()/sendmsg()/sendmmsg(). The
problem then got duplicated into MPTCP.
Landlock did not take it into account when its TCP support was added,
leaving a bypass of TCP connect policy.
Ideally a call to the LSM hook would be added in the fastopen code path,
in order to fix this generically. But connect() hooks are designed to run
with the socket locked, unlike sendmsg() hooks.
Closes: https://github.com/landlock-lsm/linux/issues/41
Fixes: fff69fb03dde ("landlock: Support network rules with TCP bind and connect")
Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
---
security/landlock/net.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/security/landlock/net.c b/security/landlock/net.c
index cbff59ec3aba..46c17116fcf4 100644
--- a/security/landlock/net.c
+++ b/security/landlock/net.c
@@ -351,6 +351,14 @@ static int hook_socket_sendmsg(struct socket *const sock,
access_mask_t access_request;
int ret = 0;
+ if ((msg->msg_flags & MSG_FASTOPEN) && address && sk_is_tcp(sock->sk)) {
+ ret = current_check_access_socket(
+ sock, address, addrlen, LANDLOCK_ACCESS_NET_CONNECT_TCP,
+ true);
+ if (ret != 0)
+ return ret;
+ }
+
if (sk_is_udp(sock->sk))
access_request = LANDLOCK_ACCESS_NET_CONNECT_SEND_UDP;
else
--
2.47.3
^ permalink raw reply related
* [PATCH v2 2/2] selftests/landlock: Add test for TCP fast open
From: Matthieu Buffet @ 2026-07-01 21:46 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Bryam Vargas, Günther Noack, linux-security-module,
Mikhail Ivanov, Paul Moore, Eric Dumazet, Neal Cardwell,
linux-kernel, netdev, Matthieu Buffet
In-Reply-To: <20260701214628.33319-1-matthieu@buffet.re>
Enforce that TCP Fast Open is controlled by
LANDLOCK_ACCESS_NET_CONNECT_TCP. Semantics of connect() and
sendmsg(MSG_FASTOPEN) should be identical from Landlock's perspective.
Also enforce error code consistency, since UDP sockets ignore
the MSG_FASTOPEN flag while Unix sockets reject it.
Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
---
tools/testing/selftests/landlock/net_test.c | 94 +++++++++++++++++++++
1 file changed, 94 insertions(+)
diff --git a/tools/testing/selftests/landlock/net_test.c b/tools/testing/selftests/landlock/net_test.c
index 2ed1f76b7a8b..2e4dc5025b04 100644
--- a/tools/testing/selftests/landlock/net_test.c
+++ b/tools/testing/selftests/landlock/net_test.c
@@ -1281,6 +1281,100 @@ TEST_F(protocol, connect_unspec)
EXPECT_EQ(0, close(bind_fd));
}
+TEST_F(protocol, tcp_fastopen)
+{
+ const bool restricted = variant->sandbox == TCP_SANDBOX &&
+ variant->prot.type == SOCK_STREAM &&
+ (variant->prot.protocol == IPPROTO_TCP || variant->prot.protocol == IPPROTO_IP) &&
+ (variant->prot.domain == AF_INET || variant->prot.domain == AF_INET6);
+ const struct landlock_ruleset_attr ruleset_attr = {
+ .handled_access_net = LANDLOCK_ACCESS_NET_CONNECT_TCP,
+ };
+ int bind_fd, client_fd, status;
+ char buf;
+ pid_t child;
+
+ bind_fd = socket_variant(&self->srv0);
+ ASSERT_LE(0, bind_fd);
+ EXPECT_EQ(0, bind_variant(bind_fd, &self->srv0));
+ if (self->srv0.protocol.type == SOCK_STREAM)
+ EXPECT_EQ(0, listen(bind_fd, backlog));
+
+ child = fork();
+ ASSERT_LE(0, child);
+ if (child == 0) {
+ int connect_fd, ret;
+
+ /* Closes listening socket for the child. */
+ EXPECT_EQ(0, close(bind_fd));
+
+ connect_fd = socket_variant(&self->srv0);
+ ASSERT_LE(0, connect_fd);
+
+ if (variant->sandbox == TCP_SANDBOX) {
+ const int ruleset_fd = landlock_create_ruleset(
+ &ruleset_attr, sizeof(ruleset_attr), 0);
+ ASSERT_LE(0, ruleset_fd);
+
+ enforce_ruleset(_metadata, ruleset_fd);
+ EXPECT_EQ(0, close(ruleset_fd));
+ }
+
+ /* Fast Open with no address. */
+ ret = sendto_variant(connect_fd, NULL, NULL, 0, MSG_FASTOPEN);
+ if (self->srv0.protocol.domain == AF_UNIX) {
+ EXPECT_EQ(-ENOTCONN, ret);
+ } else if (self->srv0.protocol.type == SOCK_DGRAM) {
+ EXPECT_EQ(-EDESTADDRREQ, ret);
+ } else {
+ EXPECT_EQ(-EINVAL, ret);
+ }
+
+ /* Fast Open to a denied address. */
+ ret = sendto_variant(connect_fd, &self->srv0, "A", 1, MSG_FASTOPEN);
+ if (restricted) {
+ EXPECT_EQ(-EACCES, ret);
+ } else if (self->srv0.protocol.domain == AF_UNIX &&
+ self->srv0.protocol.type == SOCK_STREAM) {
+ EXPECT_EQ(-EOPNOTSUPP, ret);
+ } else {
+ EXPECT_EQ(0, ret);
+ }
+
+ EXPECT_EQ(0, close(connect_fd));
+ _exit(_metadata->exit_code);
+ return;
+ }
+
+ client_fd = bind_fd;
+ if (!restricted && self->srv0.protocol.type == SOCK_STREAM &&
+ self->srv0.protocol.domain != AF_UNIX) {
+ client_fd = accept(bind_fd, NULL, 0);
+ ASSERT_LE(0, client_fd);
+ }
+
+ if (restricted) {
+ EXPECT_EQ(-1, read(client_fd, &buf, 1));
+ EXPECT_EQ(ENOTCONN, errno);
+ } else if (self->srv0.protocol.domain == AF_UNIX &&
+ self->srv0.protocol.type == SOCK_STREAM) {
+ EXPECT_EQ(-1, read(client_fd, &buf, 1));
+ EXPECT_EQ(EINVAL, errno);
+ } else {
+ EXPECT_EQ(1, read(client_fd, &buf, 1));
+ EXPECT_EQ('A', buf);
+ }
+
+ EXPECT_EQ(child, waitpid(child, &status, 0));
+ EXPECT_EQ(1, WIFEXITED(status));
+ EXPECT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
+
+ if (client_fd != bind_fd)
+ EXPECT_LE(0, close(client_fd));
+
+ EXPECT_EQ(0, close(bind_fd));
+}
+
TEST_F(protocol, sendmsg_stream)
{
int srv0_fd, tmp_fd, client_fd, res;
--
2.47.3
^ permalink raw reply related
* Re: [PATCH net-next v3 2/3] ptp: Add driver for R-Car Gen4
From: Vadim Fedorenko @ 2026-07-01 21:47 UTC (permalink / raw)
To: Niklas Söderlund, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Geert Uytterhoeven, Magnus Damm, Richard Cochran,
Andrew Lunn, DavidS. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, linux-renesas-soc, devicetree, linux-kernel, netdev
In-Reply-To: <20260701090607.1108208-3-niklas.soderlund+renesas@ragnatech.se>
On 01/07/2026 10:06, Niklas Söderlund wrote:
> Add driver for the gPTP timer found on R-Car Gen4 devices. The timer is
> system-wide and shared by different Ethernet devices on each Gen4
> platform. The operation of the timer is however not completely in
> depended of the systems Ethernet devices.
>
> - On R-Car S4 is gated by the RSWITCH Ethernet module clock.
>
> - On R-Car V4H is gated by the RTSN Ethernet module clock.
>
> - On R-Car V4M is gated by its own module clock, the system have
> neither RTSN or RSWITCH device. But the module clock is the same as
> RTSN on V4H and the documentation referees to it as tsn (EtherTSN).
>
> The gPTP device do have its own register space on all three platforms.
> But on S4 and V4H it will share its clock and reset property with
> RSWITCH or RTSN, respectively.
>
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
[...]
> +static int ptp_rcar_gen4_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
> +{
> + struct ptp_rcar_gen4_priv *priv = ptp_to_priv(ptp);
> + s64 addend = priv->default_addend;
> + bool neg_adj = scaled_ppm < 0;
> + unsigned long flags;
> + s64 diff;
> +
> + if (neg_adj)
> + scaled_ppm = -scaled_ppm;
> + diff = div_s64(addend * scaled_ppm_to_ppb(scaled_ppm), NSEC_PER_SEC);
> + addend = neg_adj ? addend - diff : addend + diff;
> +
> + spin_lock_irqsave(&priv->lock, flags);
> + iowrite32(addend, priv->base + PTPTIVC0_REG);
how are you so sure that addend will always fit into s32? It looks like
it may go over in some cases, no?
> + spin_unlock_irqrestore(&priv->lock, flags);
> +
> + return 0;
> +}
^ permalink raw reply
* Re: [PATCH net v2] mac802154: remove interfaces with RCU list deletion
From: Kuniyuki Iwashima @ 2026-07-01 21:49 UTC (permalink / raw)
To: Yousef Alhouseen
Cc: alex.aring, stefan, miquel.raynal, davem, edumazet, kuba, pabeni,
horms, marcel, linux-wpan, netdev, linux-kernel, stable,
syzbot+36256deb69a588e9290e
In-Reply-To: <20260701164222.9094-1-alhouseenyousef@gmail.com>
On Wed, Jul 1, 2026 at 9:42 AM Yousef Alhouseen
<alhouseenyousef@gmail.com> wrote:
>
> Queue wake, stop, and disable paths walk local->interfaces under RCU.
> The bulk hardware teardown path removes entries with list_del(), so an
> asynchronous transmit completion can follow a poisoned list node in
> ieee802154_wake_queue().
>
> Use list_del_rcu() as in the single-interface removal path. The following
> unregister_netdevice() waits for in-flight RCU readers before freeing the
> netdevice, so no separate grace-period wait is needed.
>
> Fixes: 592dfbfc72f5 ("mac820154: move interface unregistration into iface")
> Reported-by: syzbot+36256deb69a588e9290e@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=36256deb69a588e9290e
> Cc: stable@vger.kernel.org
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply
* Re: [PATCH net-next v2 1/4] net: add sockopt_init_user() for getsockopt conversion
From: Willem de Bruijn @ 2026-07-01 21:50 UTC (permalink / raw)
To: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Willem de Bruijn, Shuah Khan,
sdf.kernel
Cc: netdev, linux-kernel, linux-kselftest, Breno Leitao, kernel-team
In-Reply-To: <20260630-getsockopt_phase2-v2-1-193335f3d4d1@debian.org>
Breno Leitao wrote:
> Add a helper that initializes a user-backed sockopt_t from the (optval,
> optlen) __user pair passed to a getsockopt() callback.
>
> It is used by transitional __user getsockopt wrappers while the
> proto-layer getsockopt callbacks are converted to take a sockopt_t, and
> is removed once the conversion is complete.
>
> The goal is to help to convert leafs. Example:
>
> sock_common_getsockopt(... char __user *optval, int __user *optlen)
> → udp_getsockopt(sk, level, optname, optval__user, optlen__user)
> → udp_lib_getsockopt(sk, level, optname, &opt) /* needs a sockopt_t */
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply
* Re: [PATCH net-next v2 2/4] udp: convert udp_lib_getsockopt to sockopt_t
From: Willem de Bruijn @ 2026-07-01 21:51 UTC (permalink / raw)
To: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Willem de Bruijn, Shuah Khan,
sdf.kernel
Cc: netdev, linux-kernel, linux-kselftest, Breno Leitao, kernel-team
In-Reply-To: <20260630-getsockopt_phase2-v2-2-193335f3d4d1@debian.org>
Breno Leitao wrote:
> In preparation for converting the proto-layer getsockopt callbacks to the
> sockopt_t interface, switch udp_lib_getsockopt() to take a sockopt_t.
>
> The thin udp_getsockopt()/udpv6_getsockopt() wrappers keep their __user
> signature for now: they build a user-backed sockopt_t with
> sockopt_init_user(), call the helper, and write the returned length back
> to optlen. The helper uses copy_to_iter() instead of copy_to_user().
> No functional change.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply
* Re: [RFC PATCH 1/2] landlock: fix TCP Fast Open connection bypass
From: Matthieu Buffet @ 2026-07-01 21:42 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Bryam Vargas, Günther Noack, linux-security-module,
Mikhail Ivanov, Paul Moore, Eric Dumazet, Neal Cardwell,
linux-kernel, netdev
In-Reply-To: <20260626.taijohThood1@digikod.net>
Hi Mickaël,
On 6/26/2026 10:40 PM, Mickaël Salaün wrote:
> Thanks Matthieu, could you please rebase this serise on the master
> branch (especially on top of your UDP changes)?
Yes I hoped I could send it just in time before the UDP merge, but no
luck. Here is the patch with your feedback, and rebased on next over UDP.
Have a nice day!
--
Matthieu
^ permalink raw reply
* Re: [PATCH net-next v2 3/4] ipv4: raw: convert do_raw_getsockopt to sockopt_t
From: Willem de Bruijn @ 2026-07-01 21:52 UTC (permalink / raw)
To: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Willem de Bruijn, Shuah Khan,
sdf.kernel
Cc: netdev, linux-kernel, linux-kselftest, Breno Leitao, kernel-team
In-Reply-To: <20260630-getsockopt_phase2-v2-3-193335f3d4d1@debian.org>
Breno Leitao wrote:
> Continue converting the proto-layer getsockopt callbacks to the sockopt_t
> interface, switching do_raw_getsockopt() and its raw_geticmpfilter()
> helper to take a sockopt_t.
>
> The thin raw_getsockopt() wrapper keeps its __user signature for now: it
> builds a user-backed sockopt_t with sockopt_init_user(), calls the helper,
> and writes the returned length back to optlen. The helper uses
> copy_to_iter() instead of copy_to_user(). No functional change.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply
* Re: [PATCH net-next v2 4/4] selftests: net: getsockopt_iter: add raw ICMP_FILTER coverage
From: Willem de Bruijn @ 2026-07-01 21:55 UTC (permalink / raw)
To: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Willem de Bruijn, Shuah Khan,
sdf.kernel
Cc: netdev, linux-kernel, linux-kselftest, Breno Leitao, kernel-team
In-Reply-To: <20260630-getsockopt_phase2-v2-4-193335f3d4d1@debian.org>
Breno Leitao wrote:
> Exercise the raw getsockopt path now backed by sockopt_t. ICMP_FILTER
> returns a fixed-size struct and, unlike the int/u64 options already
> covered, clamps the length down to the user buffer on a short read
> instead of failing, so check that semantic explicitly along with the
> exact and oversized cases, the -EOPNOTSUPP path on a non-ICMP raw
> socket, and an unknown optname.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
> +TEST_F(raw, icmpfilter_oversize_clamped)
> +{
> + char buf[16] = {};
Not a reason to respin, but instead of a raw constant, something like
sizeof(struct icmp_filter) + 1 is more robust and descriptive.
^ permalink raw reply
* RE: [PATCH net-next v6 06/15] net: ethernet: oa_tc6: Support for hardware timestamp
From: Selvamani Rajagopal @ 2026-07-01 21:59 UTC (permalink / raw)
To: Jerry.Ray@microchip.com, andrew@lunn.ch, Piergiorgio Beruto,
hkallweit1@gmail.com, linux@armlinux.org.uk, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
andrew+netdev@lunn.ch, Parthiban.Veerasooran@microchip.com,
richardcochran@gmail.com, robh@kernel.org, krzk+dt@kernel.org,
conor+dt@kernel.org, horms@kernel.org, corbet@lwn.net,
skhan@linuxfoundation.org
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
devicetree@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <CH3PR11MB77231F85DFC870B5B43AFDE1EFF62@CH3PR11MB7723.namprd11.prod.outlook.com>
> -----Original Message-----
> From: Jerry.Ray@microchip.com <Jerry.Ray@microchip.com>
> Subject: RE: [PATCH net-next v6 06/15] net: ethernet: oa_tc6: Support for hardware
> timestamp
>
> The receive path here unconditionally consumes a 64-bit (OA_TC6_TSTAMP_SZ = 8
> byte) frame timestamp: it is gated only on the footer RTSA bit, always copies
> two 32-bit words, and pulls sizeof(ts) from the skb. The RX buffer is likewise
> sized with a fixed + OA_TC6_TSTAMP_SZ. Nothing consults the configured timestamp
> width.
>
> But oa_tc6_set_hwtstamp_settings() only sets CONFIG0.FTSE:
>
I do remember our conversation on this subject.
For this comment and the comment at the bottom, I believe the feedback is to set
CONFIG0_FTSS_64BIT bit In OA TC6 framework. Will do.
> > +}
> > +/* Tx timestamp capture register A (high) */
> > +#define OA_TC6_REG_TTSCA_HIGH (0x1010)
> > +
>
> Please fix the value of OA_TC6_REG_TTSCA_HIGH to 0x0010 in patch 6 where it is
> introduced rather than correcting it in patch 12.
I didn't realize that. Let me check and fix it. Thanks
>
> > /* Control command header */
> >
> > + cfg0 &= ~CONFIG0_FTSE_ENABLE;
>
> It never sets the 64-bit frame-timestamp-select bit (CONFIG0 bit 6). So the
> framework enables timestamping for 32-bit bitwidth while the
> receive path strips 8 bytes.
>
> This happens to work for the S2500 only because the S2500 driver forces 64-bit
> independently in its own SPI config (patch 12/15):
>
> Please review my feedback against your v3 patch series on 5-Jun.
> (CONFIG0_FTSE_ENABLE | CONFIG0_FTSS_64BIT)
>
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/2] bpf, sockmap: disallow update and delete from tc, xdp, socket_filter and flow_dissector
From: Emil Tsalapatis @ 2026-07-01 22:02 UTC (permalink / raw)
To: Sechang Lim, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, John Fastabend,
David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
Shuah Khan
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
Emil Tsalapatis, Stanislav Fomichev, Jiayuan Chen, Varun R Mallya,
Ihor Solodrai, bpf, netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260630145410.3648099-2-rhkrqnwk98@gmail.com>
On Tue Jun 30, 2026 at 10:54 AM EDT, Sechang Lim wrote:
> sock_map_update_common() and __sock_map_delete() hold stab->lock and call
> sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
> write. That gives the order stab->lock -> sk_callback_lock.
>
> The reverse order comes from the SK_SKB stream parser.
> sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
> verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
> __tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
> deletes from the sockmap and takes stab->lock:
>
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> syz.9.8824 is trying to acquire lock:
> (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
> but task is already holding lock:
> (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173
>
> -> #1 (clock-AF_INET){++.-}-{3:3}:
> _raw_write_lock_bh
> sock_map_del_link net/core/sock_map.c:167
> sock_map_unref net/core/sock_map.c:184
> sock_map_update_common net/core/sock_map.c:509
> sock_map_update_elem_sys net/core/sock_map.c:588
> map_update_elem kernel/bpf/syscall.c:1805
>
> -> #0 (&stab->lock){+.-.}-{3:3}:
> _raw_spin_lock_bh
> __sock_map_delete net/core/sock_map.c:421
> sock_map_delete_elem net/core/sock_map.c:452
> bpf_prog_06044d24140080b6
> tcx_run net/core/dev.c:4451
> sch_handle_egress net/core/dev.c:4541
> __dev_queue_xmit net/core/dev.c:4808
> ...
> tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
> strp_data_ready net/strparser/strparser.c:402
> sk_psock_strp_data_ready net/core/skmsg.c:1174
> tcp_data_queue net/ipv4/tcp_input.c:5661
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> rlock(clock-AF_INET);
> lock(&stab->lock);
> lock(clock-AF_INET);
> lock(&stab->lock);
>
> *** DEADLOCK ***
>
> A tc, xdp, socket_filter or flow_dissector program has no reason to
> update or delete a sockmap, and redirect does not go through here. Drop
> them from may_update_sockmap() so the verifier rejects it. It also
> closes the matching sockhash inversion.
>
> Suggested-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> kernel/bpf/verifier.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 25aea4271cd0..83ea3b33ff67 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8488,12 +8488,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
> if (func_id == BPF_FUNC_map_delete_elem)
> return true;
> break;
> - case BPF_PROG_TYPE_SOCKET_FILTER:
> - case BPF_PROG_TYPE_SCHED_CLS:
> - case BPF_PROG_TYPE_SCHED_ACT:
> - case BPF_PROG_TYPE_XDP:
> case BPF_PROG_TYPE_SK_REUSEPORT:
> - case BPF_PROG_TYPE_FLOW_DISSECTOR:
> case BPF_PROG_TYPE_SK_LOOKUP:
> return true;
> default:
^ permalink raw reply
* Re: Ethtool : PRBS feature
From: Andrew Lunn @ 2026-07-01 22:02 UTC (permalink / raw)
To: Srinivasan, Vijay
Cc: Das, Shubham, Alexander Duyck, Lee Trager, Maxime Chevallier,
netdev@vger.kernel.org, mkubecek@suse.cz, D H, Siddaraju,
Chintalapalle, Balaji, Lindberg, Magnus,
niklas.damberg@ericsson.com, Wirandi, Jonas
In-Reply-To: <BL3PR11MB6385517EECD04C783E9DC42488F62@BL3PR11MB6385.namprd11.prod.outlook.com>
On Wed, Jul 01, 2026 at 09:38:08PM +0000, Srinivasan, Vijay wrote:
> Hi Andrew,
> I think there is a disconnect here.
Which proves my point. The specification is not sufficient if you have
to keep correcting me.
The kAPI should be understandable by somebody who has a general
networking background. Please write a specification with that
assumption in mind. Don't assume the reader is a test engineer who has
used PRBS for half his life. Assume it is a brand new test engineer
who is hearing PRBS for the first time. That is what most engineers on
the netdev list are. Me included.
Andrew
^ permalink raw reply
* Re: [PATCH bpf-next v4 2/2] selftests/bpf: drop tc/xdp/flow_dissector/socket_filter sockmap mutation tests
From: Emil Tsalapatis @ 2026-07-01 22:04 UTC (permalink / raw)
To: Sechang Lim, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, John Fastabend,
David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
Shuah Khan
Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
Emil Tsalapatis, Stanislav Fomichev, Jiayuan Chen, Varun R Mallya,
Ihor Solodrai, bpf, netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260630145410.3648099-3-rhkrqnwk98@gmail.com>
On Tue Jun 30, 2026 at 10:54 AM EDT, Sechang Lim wrote:
> tc, xdp, socket_filter and flow_dissector programs can no longer update
> or delete a sockmap. Adjust the tests:
>
> - verifier_sockmap_mutate: the tc, xdp, socket_filter and
> flow_dissector cases now expect __failure with "cannot update sockmap
> in this context".
> - sockmap_basic: drop "sockmap update" / "sockhash update", which load
> a SEC("tc") program that copies a sock between maps.
> - fexit_bpf2bpf: drop "func_sockmap_update", whose freplace program
> updates a sockmap in the tc cls_redirect context.
>
> Remove the now-unused test_sockmap_update.c and freplace_cls_redirect.c.
>
> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> .../selftests/bpf/prog_tests/fexit_bpf2bpf.c | 13 -----
> .../selftests/bpf/prog_tests/sockmap_basic.c | 52 -------------------
> .../bpf/progs/freplace_cls_redirect.c | 34 ------------
> .../selftests/bpf/progs/test_sockmap_update.c | 48 -----------------
> .../bpf/progs/verifier_sockmap_mutate.c | 12 ++---
> 5 files changed, 6 insertions(+), 153 deletions(-)
> delete mode 100644 tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
> delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_update.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
> index 92c20803ea76..d3a954158c33 100644
> --- a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
> +++ b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c
> @@ -336,17 +336,6 @@ static void test_fmod_ret_freplace(void)
> }
>
>
> -static void test_func_sockmap_update(void)
> -{
> - const char *prog_name[] = {
> - "freplace/cls_redirect",
> - };
> - test_fexit_bpf2bpf_common("./freplace_cls_redirect.bpf.o",
> - "./test_cls_redirect.bpf.o",
> - ARRAY_SIZE(prog_name),
> - prog_name, false, NULL);
> -}
> -
> static void test_func_replace_void(void)
> {
> const char *prog_name[] = {
> @@ -599,8 +588,6 @@ void serial_test_fexit_bpf2bpf(void)
> test_func_replace();
> if (test__start_subtest("func_replace_verify"))
> test_func_replace_verify();
> - if (test__start_subtest("func_sockmap_update"))
> - test_func_sockmap_update();
> if (test__start_subtest("func_replace_return_code"))
> test_func_replace_return_code();
> if (test__start_subtest("func_map_prog_compatibility"))
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> index cb3229711f93..33f788e2786d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> @@ -7,7 +7,6 @@
>
> #include "test_progs.h"
> #include "test_skmsg_load_helpers.skel.h"
> -#include "test_sockmap_update.skel.h"
> #include "test_sockmap_invalid_update.skel.h"
> #include "test_sockmap_skb_verdict_attach.skel.h"
> #include "test_sockmap_progs_query.skel.h"
> @@ -235,53 +234,6 @@ static void test_skmsg_helpers_with_link(enum bpf_map_type map_type)
> test_skmsg_load_helpers__destroy(skel);
> }
>
> -static void test_sockmap_update(enum bpf_map_type map_type)
> -{
> - int err, prog, src;
> - struct test_sockmap_update *skel;
> - struct bpf_map *dst_map;
> - const __u32 zero = 0;
> - char dummy[14] = {0};
> - LIBBPF_OPTS(bpf_test_run_opts, topts,
> - .data_in = dummy,
> - .data_size_in = sizeof(dummy),
> - .repeat = 1,
> - );
> - __s64 sk;
> -
> - sk = connected_socket_v4();
> - if (!ASSERT_NEQ(sk, -1, "connected_socket_v4"))
> - return;
> -
> - skel = test_sockmap_update__open_and_load();
> - if (!ASSERT_OK_PTR(skel, "open_and_load"))
> - goto close_sk;
> -
> - prog = bpf_program__fd(skel->progs.copy_sock_map);
> - src = bpf_map__fd(skel->maps.src);
> - if (map_type == BPF_MAP_TYPE_SOCKMAP)
> - dst_map = skel->maps.dst_sock_map;
> - else
> - dst_map = skel->maps.dst_sock_hash;
> -
> - err = bpf_map_update_elem(src, &zero, &sk, BPF_NOEXIST);
> - if (!ASSERT_OK(err, "update_elem(src)"))
> - goto out;
> -
> - err = bpf_prog_test_run_opts(prog, &topts);
> - if (!ASSERT_OK(err, "test_run"))
> - goto out;
> - if (!ASSERT_NEQ(topts.retval, 0, "test_run retval"))
> - goto out;
> -
> - compare_cookies(skel->maps.src, dst_map);
> -
> -out:
> - test_sockmap_update__destroy(skel);
> -close_sk:
> - close(sk);
> -}
> -
> static void test_sockmap_invalid_update(void)
> {
> struct test_sockmap_invalid_update *skel;
> @@ -1385,10 +1337,6 @@ void test_sockmap_basic(void)
> test_skmsg_helpers(BPF_MAP_TYPE_SOCKMAP);
> if (test__start_subtest("sockhash sk_msg load helpers"))
> test_skmsg_helpers(BPF_MAP_TYPE_SOCKHASH);
> - if (test__start_subtest("sockmap update"))
> - test_sockmap_update(BPF_MAP_TYPE_SOCKMAP);
> - if (test__start_subtest("sockhash update"))
> - test_sockmap_update(BPF_MAP_TYPE_SOCKHASH);
> if (test__start_subtest("sockmap update in unsafe context"))
> test_sockmap_invalid_update();
> if (test__start_subtest("sockmap copy"))
> diff --git a/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c b/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
> deleted file mode 100644
> index 7e94412d47a5..000000000000
> --- a/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c
> +++ /dev/null
> @@ -1,34 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -// Copyright (c) 2020 Facebook
> -
> -#include <linux/stddef.h>
> -#include <linux/bpf.h>
> -#include <linux/pkt_cls.h>
> -#include <bpf/bpf_endian.h>
> -#include <bpf/bpf_helpers.h>
> -
> -struct {
> - __uint(type, BPF_MAP_TYPE_SOCKMAP);
> - __type(key, int);
> - __type(value, int);
> - __uint(max_entries, 2);
> -} sock_map SEC(".maps");
> -
> -SEC("freplace/cls_redirect")
> -int freplace_cls_redirect_test(struct __sk_buff *skb)
> -{
> - int ret = 0;
> - const int zero = 0;
> - struct bpf_sock *sk;
> -
> - sk = bpf_map_lookup_elem(&sock_map, &zero);
> - if (!sk)
> - return TC_ACT_SHOT;
> -
> - ret = bpf_map_update_elem(&sock_map, &zero, sk, 0);
> - bpf_sk_release(sk);
> -
> - return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT;
> -}
> -
> -char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_update.c b/tools/testing/selftests/bpf/progs/test_sockmap_update.c
> deleted file mode 100644
> index 6d64ea536e3d..000000000000
> --- a/tools/testing/selftests/bpf/progs/test_sockmap_update.c
> +++ /dev/null
> @@ -1,48 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -// Copyright (c) 2020 Cloudflare
> -#include "vmlinux.h"
> -#include <bpf/bpf_helpers.h>
> -
> -struct {
> - __uint(type, BPF_MAP_TYPE_SOCKMAP);
> - __uint(max_entries, 1);
> - __type(key, __u32);
> - __type(value, __u64);
> -} src SEC(".maps");
> -
> -struct {
> - __uint(type, BPF_MAP_TYPE_SOCKMAP);
> - __uint(max_entries, 1);
> - __type(key, __u32);
> - __type(value, __u64);
> -} dst_sock_map SEC(".maps");
> -
> -struct {
> - __uint(type, BPF_MAP_TYPE_SOCKHASH);
> - __uint(max_entries, 1);
> - __type(key, __u32);
> - __type(value, __u64);
> -} dst_sock_hash SEC(".maps");
> -
> -SEC("tc")
> -int copy_sock_map(void *ctx)
> -{
> - struct bpf_sock *sk;
> - bool failed = false;
> - __u32 key = 0;
> -
> - sk = bpf_map_lookup_elem(&src, &key);
> - if (!sk)
> - return SK_DROP;
> -
> - if (bpf_map_update_elem(&dst_sock_map, &key, sk, 0))
> - failed = true;
> -
> - if (bpf_map_update_elem(&dst_sock_hash, &key, sk, 0))
> - failed = true;
> -
> - bpf_sk_release(sk);
> - return failed ? SK_DROP : SK_PASS;
> -}
> -
> -char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c b/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
> index fe4b123187b8..20332a731d4e 100644
> --- a/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_sockmap_mutate.c
> @@ -74,7 +74,7 @@ static __always_inline void test_sockmap_lookup_and_mutate(void)
> }
>
> SEC("action")
> -__success
> +__failure __msg("cannot update sockmap in this context")
> int test_sched_act(struct __sk_buff *skb)
> {
> test_sockmap_mutate(skb->sk);
> @@ -82,7 +82,7 @@ int test_sched_act(struct __sk_buff *skb)
> }
>
> SEC("classifier")
> -__success
> +__failure __msg("cannot update sockmap in this context")
> int test_sched_cls(struct __sk_buff *skb)
> {
> test_sockmap_mutate(skb->sk);
> @@ -90,7 +90,7 @@ int test_sched_cls(struct __sk_buff *skb)
> }
>
> SEC("flow_dissector")
> -__success
> +__failure __msg("cannot update sockmap in this context")
> int test_flow_dissector_delete(struct __sk_buff *skb __always_unused)
> {
> test_sockmap_delete();
> @@ -98,7 +98,7 @@ int test_flow_dissector_delete(struct __sk_buff *skb __always_unused)
> }
>
> SEC("flow_dissector")
> -__failure __msg("program of this type cannot use helper bpf_sk_release")
> +__failure __msg("cannot update sockmap in this context")
> int test_flow_dissector_update(struct __sk_buff *skb __always_unused)
> {
> test_sockmap_lookup_and_update(); /* no access to skb->sk */
> @@ -146,7 +146,7 @@ int test_sk_reuseport(struct sk_reuseport_md *ctx)
> }
>
> SEC("socket")
> -__success
> +__failure __msg("cannot update sockmap in this context")
> int test_socket_filter(struct __sk_buff *skb)
> {
> test_sockmap_mutate(skb->sk);
> @@ -179,7 +179,7 @@ int test_sockops_update_dedicated(struct bpf_sock_ops *ctx)
> }
>
> SEC("xdp")
> -__success
> +__failure __msg("cannot update sockmap in this context")
> int test_xdp(struct xdp_md *ctx __always_unused)
> {
> test_sockmap_lookup_and_mutate();
^ permalink raw reply
* Re: [PATCH net-next 1/2] macvlan: annotate data-races around vlan->mode and vlan->flags
From: Kuniyuki Iwashima @ 2026-07-01 22:11 UTC (permalink / raw)
To: Nikolay Aleksandrov
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Simon Horman, netdev, eric.dumazet
In-Reply-To: <cbdc27c1-0fa9-4e4a-96ee-82309b58414c@blackwall.org>
On Wed, Jul 1, 2026 at 5:09 AM Nikolay Aleksandrov <razor@blackwall.org> wrote:
>
> On 01/07/2026 11:22, Eric Dumazet wrote:
> > Both fields can be changed in macvlan_changelink() while being read
> > locklessly.
> >
> > Add READ_ONCE()/WRITE_ONCE() annotations.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > drivers/net/macvlan.c | 38 +++++++++++++++++++++-----------------
> > 1 file changed, 21 insertions(+), 17 deletions(-)
> >
>
> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply
* Re: [PATCH net-next 2/2] macvlan: no longer rely on RTNL in macvlan_fill_info()
From: Kuniyuki Iwashima @ 2026-07-01 22:14 UTC (permalink / raw)
To: Nikolay Aleksandrov
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
Simon Horman, netdev, eric.dumazet
In-Reply-To: <3f70661e-539a-4157-9e5f-1ee94a720519@blackwall.org>
On Wed, Jul 1, 2026 at 5:12 AM Nikolay Aleksandrov <razor@blackwall.org> wrote:
>
> On 01/07/2026 11:22, Eric Dumazet wrote:
> > Add READ_ONCE()/WRITE_ONCE() annotations on vlan->mode, vlan->flags,
> > vlan->bc_queue_len_req and port->bc_cutoff.
> >
> > Fill IFLA_MACVLAN_MACADDR_DATA nested attribute and compute
> > on the fly the precise number of elements we put in it,
> > to fill an accurate IFLA_MACVLAN_MACADDR_COUNT attribute
> > as some user space applications could depend on its value
> > and the attributes order.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> > drivers/net/macvlan.c | 71 +++++++++++++++++++++++++++++--------------
> > 1 file changed, 48 insertions(+), 23 deletions(-)
> >
>
> The snippet that sets macaddr_count gave me pause. :)
> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
same here :p
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
>
> [snip]
> > +
> > + if (READ_ONCE(vlan->macaddr_count) > 0) {
> > nest = nla_nest_start_noflag(skb, IFLA_MACVLAN_MACADDR_DATA);
> > if (nest == NULL)
> > goto nla_put_failure;
> >
> > for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
> > - if (macvlan_fill_info_macaddr(skb, vlan, i))
> > + cnt = macvlan_fill_info_macaddr(skb, vlan, i);
> > + if (cnt < 0)
> > goto nla_put_failure;
> > + macaddr_count += cnt;
> > }
> > - nla_nest_end(skb, nest);
> > + if (!macaddr_count)
> > + nla_nest_cancel(skb, nest);
> > + else if (nla_nest_end_safe(skb, nest) < 0)
> > + goto nla_put_failure;
> > }
> > - if (nla_put_u32(skb, IFLA_MACVLAN_BC_QUEUE_LEN, vlan->bc_queue_len_req))
> > + *(u32 *)nla_data(attr) = macaddr_count;
> > +
^ permalink raw reply
* [PATCH net-next v2 00/15] ibmveth: Add multi-queue RX support
From: Mingming Cao @ 2026-07-01 22:23 UTC (permalink / raw)
To: netdev
Cc: horms, bjking1, haren, ricklind, mmc, kuba, edumazet, pabeni,
linuxppc-dev, maddy, mpe
Hi,
Power11 PHYP firmware adds Virtual Ethernet multi-queue (MQ) RX for
the ibmveth device: multiple logical-LAN RX queues, per-queue buffer
posting, and completion delivery. Guest Linux did not use that
platform support; ibmveth still registered one RX queue even when
PHYP was MQ-capable.
This series adds the ibmveth MQ client. When PHYP advertises the
capability through H_ILLAN_ATTRIBUTES, the driver registers
multiple RX queues, receives on per-queue NAPI, and exposes queue
count through ethtool. Older firmware without the bit is unchanged.
Please apply to net-next.
Background
ibmveth today registers one logical LAN, one set of buffer pools, and
one NAPI context. PHYP MQ mode gives each RX queue its own handle:
buffers are posted with H_ADD_LOGICAL_LAN_BUFFERS_QUEUE, subordinate
queues register through H_REG_LOGICAL_LAN_QUEUE, and traffic can
land on any active queue. Queue selection is firmware-defined; v1
does not program RSS or hash tables. The driver needs per-queue
pools, IRQs, and poll state to match.
Queue-aware hcalls are selected only when probe sets multi_queue
from H_ILLAN_ATTRIBUTES; legacy firmware keeps the original hcall
path unchanged through the entire series.
This splits the work so review follows the actual bring-up sequence:
1. Hypercall definitions and MQ data structures (patches 1-2)
2. Refactor open/close into helpers - RX, per-queue pools,
IRQ, TX, PHYP (3-9)
3. Turn on the MQ datapath at probe/open (10)
4. Per-queue RX/TX stats and sysfs pool readout (11-12)
5. Runtime RX queue resize via ethtool -L (13-14)
6. LPAR stability fix (15)
- Helper patches (3-9) reshape ibmveth_open()/close() into
queue-aware helpers. Runtime behaviour is unchanged through that
block: num_rx_queues stays 1 and multi_queue is false until patch 10.
- Patch 10 is the switch: probe sets multi_queue from firmware, raises
num_rx_queues, registers subordinates, and replenishes every active
queue.
- Patch 15 fixes poll hangs after aggressive ethtool -L cycling and
NAPI/close deadlocks on ip link down.
Testing
Tested on ppc64le PowerVM LPAR with MQ-capable firmware:
* Aggressive ethtool -L cycling (16/1/8/11/1/3/16/8/1) with ping
* MQ path: ethtool -L under iperf3 load, link down/up during traffic
* Legacy firmware (no MQ bit): full open/close/stress on the
refactored helper path to confirm single-queue behaviour is
unchanged
Changes in v2
v1 resubmit as 15 patches (Patchwork limit): same code and LPAR testing;
squashed split plus checkpatch fixes in patch 15 only.
v1: https://lore.kernel.org/r/cover.1782758799.git.mmc@linux.ibm.com
Patchwork: https://patchwork.kernel.org/project/netdevbpf/list/?series=1119106
Future work
* IRQ affinity hints for subordinate queue IRQs returned by PHYP
* Summed global no_buffer drop counter across all RX queues in MQ mode
Comments and suggestions on patch split, design, and testing are
welcome.
Mingming Cao <mmc@linux.ibm.com>
Mingming Cao (15):
ibmveth: Refactor RX resource allocation for MQ RX bring-up
ibmveth: Refactor buffer pool management for per-queue MQ RX
ibmveth: Refactor RX interrupt control for MQ RX queues
ibmveth: Refactor TX resource allocation in open/close paths
ibmveth: Add RX queue register/deregister helpers for MQ
ibmveth: Refactor open/close into MQ-ready resource pipeline
ibmveth: Add queue-aware RX buffer submit helper for MQ
ibmveth: Enable multi-queue RX receive path
ibmveth: Add per-queue RX statistics collection and reporting
ibmveth: Add per-queue TX statistics reporting
ibmveth: Expose per-queue buffer pool details via sysfs
ibmveth: Add helpers for incremental MQ RX queue resize
ibmveth: Implement incremental MQ RX queue resize
ibmveth: Wire ethtool set_channels to MQ RX queue resize
ibmveth: Fix MQ RX poll and shutdown hangs after queue resize
drivers/net/ethernet/ibm/ibmveth.c | 2350 +++++++++++++++++++++++-----
drivers/net/ethernet/ibm/ibmveth.h | 25 +-
2 files changed, 2014 insertions(+), 361 deletions(-)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox