* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-24 17:52 UTC (permalink / raw)
To: Askar Safin
Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
rostedt, torvalds, val, viro, willy
In-Reply-To: <20260624071226.2272209-1-safinaskar@gmail.com>
On Wed, Jun 24, 2026 at 12:12 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > The CRIU fifo test fails with this change. The problem is that vmsplice
> > with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> >
> > It seems we need a fix like this one:
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index 429b0714ec57..6fc49e933727 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> > file *filp)
> >
> > /* We can only do regular read/write on fifos */
> > stream_open(inode, filp);
> > + filp->f_mode |= FMODE_NOWAIT;
> >
> > switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
> > case FMODE_READ:
>
> Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
> named fifos? Or this is merely a test?
Yes, it does.
>
> If this is just a test, I think we need not to preserve this behavior.
>
> I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
> found very few packages. And it seems all them use pipes, not named fifos.
In short, this isn't how such cases are handled in the kernel. The fix is
simple and should be applied to avoid breaking random software.
>
> (On speed: I still think that my vmsplice patches are good thing,
> despite performance regressions in CRIU.)
I already explained that this isn't just a perfomance degradation, it
actually breaks the pre-dump mechanism in CRIU. vmsplice is invoked from
our parasite code within the context of a user process, where execution
speed is critical. A heavy performance penalty completely invalidates
the pre-dump logic, making the feature useless.
Under normal circumstances, patches that cause this kind of breakage
would never be merged. However, since there are exceptions to every
rule, we should let the maintainers decide how to proceed here. In CRIU,
we have a backup plan to utilize process_vm_readv to dump process
memory. We already support this mode, but it isn't the default due to
performance concerns. If these patches are merged, it will be the
only option left for CRIU to implement pre-dumping.
However, we need to look at this case in a broader context. This is yet
another example where the change introduces a workflow breakage, meaning
there might be other workloads out there that could be broken by this
change.
At a minimum, we may need to consider a deprecation plan where vmsplice
with SPLICE_F_GIFT triggers a warning for a few releases before these
changes are applied. Alternatively, we could introduce the proposed
behavior alongside a sysctl to fall back to the old behavior and explicitly
state that this fallback path will be completely deprecated in a future kernel
version.
Thanks,
Andrei
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/2] bpf: Support BPF_F_EGRESS with bpf_redirect_peer
From: Daniel Borkmann @ 2026-06-24 17:53 UTC (permalink / raw)
To: Jordan Rife, bpf
Cc: netdev, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
Stanislav Fomichev, Jiayuan Chen, Paul Chaignon
In-Reply-To: <20260618182035.43811-2-jordan@jrife.io>
On 6/18/26 8:20 PM, Jordan Rife wrote:
> We have several use cases where a pod injects traffic into the datapath
> of another so that the traffic appears to have originated from that
> pod. One such use case is a synthetic flow generator which injects
> synthetic traffic into a pod's datapath to enable dynamic probing and
> debugging. Another is a transparent proxy where connections originating
> from one pod are redirected towards another which proxies that
> connection. The new connection is bound to the IP of the original pod
> using IP_TRANSPARENT and its traffic is injected into that pod's
> datapath and handled as if it had originated there. This can be used for
> mTLS, etc.
>
> We use bpf_redirect(BPF_F_INGRESS) to direct traffic leaving the proxy,
> flow generator, etc. towards the target pod, ensuring that eBPF programs
> that are meant to intercept traffic leaving that pod are executed.
> However, this doesn't work with netkit.
>
> With netkit, an ingress redirection from proxy to workload skips eBPF
> programs that are meant to intercept traffic leaving the pod, since they
> reside on the netkit peer device. One workaround is to attach the
> same program to both the netkit peer device and the TCX ingress hook for
> the netkit pair's primary interface, but
>
> a) This seems hacky and we need to be careful not to run the same
> program twice for the same skb in cases where we want to pass that
> traffic to the host stack.
> b) We're trying to keep the proxy redirection / traffic injection
> systems as modular and separated from Cilium as possible, the system
> that manages netkit setup and core eBPF programming.
>
> It would be handy if instead we could redirect traffic directly from
> one netkit peer device to another. This patch proposes an extension
> to bpf_redirect_peer to allow us to do just that.
>
> With this patch, the BPF_F_EGRESS flag tells bpf_redirect_peer to emit
> the skb in the egress direction of the target interface's peer device
> While the main use case is netkit, I suppose you could also use this
> mode with veth as well if, e.g., there were some eBPF programs attached
> to that side of the veth pair that needed to intercept traffic.
>
> +---------------------------------------------------------------------+
> | +-------------------------+ 6. bpf_redirect_neigh(eth0) |
> | | pod (10.244.0.10) | ------------------------ |
> | | | | | |
> | | +--------+ | | +---------+ | |
> | | 1. packet -->| | | | | | | |
> | | leaves ^ | netkit |<===========|======| netkit | | |
> | | | | peer |=======(eBPF)=====>| primary | | |
> | | | | | | | | | | |
> | | | +--------+ | | +---------+ | |
> | | | | | 2. bpf_redirect v |
> | +-----------|-------------+ |___________________ +-------|
> | | | | eth0 |
> | | 5. bpf_redirect_peer(BPF_F_EGRESS) | +-------|
> | |________________________ | |
> | +-------------------------+ | | |
> | | proxy (10.244.0.11) | | | |
> | | IP_TRANSPARENT | | | |
> | | +--------+ | | +---------+ | |
> | | 3. packet <--| | | | | |<-- |
> | | enters | netkit |<===========|======| netkit | |
> | | [proxy] | peer |=======(eBPF)=====>| primary | |
> | | 4. packet -->| | | | | |
> | | leaves +--------+ | +---------+ |
> | | sip=10.244.0.10 | |
> | +-------------------------+ |
> +---------------------------------------------------------------------+
>
> Using the proxy use case as an example, in step 5 we would redirect
> traffic leaving the proxy towards the pod's peer device using
> bpf_redirect_peer(BPF_F_EGRESS).
>
> As a bonus, since the skb doesn't have to go through the backlog queue
> it can take full advantage of netkit's performance benefits. I set up a
> test where outgoing iperf3 traffic is injected into the datapath of
> another pod using either bpf_redirect_peer(BPF_F_EGRESS) or
> bpf_redirect(BPF_F_INGRESS). I used Cilium's eBPF host routing mode
> which skips the host stack and uses BPF redirect helpers to do all the
> routing.
>
> (net.ipv4.tcp_congestion_control=cubic,mtu=1500,100GiB link,Cilium
> eBPF host routing mode)
>
> BASELINE [bpf_redirect(BPF_F_INGRESS)]
> 1. [iperf pod] ==bpf_redirect([pod b], BPF_F_INGRESS)==> [pod b]
> 2. [pod b] ==bpf_redirect_neigh([eth0])==> eth0
> 3. eth0 ==over network==> [host b]
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-60.00 sec 231 GBytes 33.0 Gbits/sec 12060 sender
> [ 5] 0.00-60.00 sec 230 GBytes 33.0 Gbits/sec receiver
>
> TEST [bpf_redirect_peer(BPF_F_EGRESS)]
> 1. [iperf pod] ==bpf_redirect_peer([pod b], BPF_F_EGRESS)==> [pod b]
> 2. [pod b] ==bpf_redirect_neigh([eth0])==> eth0
> 3. eth0 ==over network==> [host b]
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-60.00 sec 272 GBytes 38.9 Gbits/sec 0 sender
> [ 5] 0.00-60.00 sec 272 GBytes 38.9 Gbits/sec receiver
>
> In this test, using bpf_redirect_peer(BPF_F_EGRESS) for the hop from
> [iperf pod] to [pod b] led to ~18% more throughput compared to
> bpf_redirect(BPF_F_INGRESS).
>
> Signed-off-by: Jordan Rife <jordan@jrife.io>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply
* Re: [PATCH v2 bpf-next 2/2] selftests/bpf: Add tests for bpf_redirect_peer with BPF_F_EGRESS
From: Daniel Borkmann @ 2026-06-24 17:54 UTC (permalink / raw)
To: Jordan Rife, bpf
Cc: netdev, Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau,
Stanislav Fomichev, Jiayuan Chen, Paul Chaignon
In-Reply-To: <20260618182035.43811-3-jordan@jrife.io>
On 6/18/26 8:20 PM, Jordan Rife wrote:
> Extend redirect tests to cover bpf_redirect_peer(BPF_F_EGRESS). SRC
> redirects to DST using bpf_redirect_peer(BPF_F_EGRESS) then traffic is
> hairpinned into DST using bpf_redirect.
>
> Signed-off-by: Jordan Rife <jordan@jrife.io>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply
* Re: [PATCH net] dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback
From: Niklas Söderlund @ 2026-06-24 18:07 UTC (permalink / raw)
To: Rob Herring (Arm)
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Krzysztof Kozlowski, Conor Dooley,
Geert Uytterhoeven, Magnus Damm, Sergei Shtylyov, netdev,
linux-renesas-soc, devicetree, linux-kernel
In-Reply-To: <20260624150250.131966-2-robh@kernel.org>
Hi Rob,
Thanks for your patch.
On 2026-06-24 10:02:50 -0500, Rob Herring (Arm) wrote:
> Fix the Micrel PHY in the example which shouldn't have the
> fallback "ethernet-phy-ieee802.3-c22" compatible:
>
> Documentation/devicetree/bindings/net/renesas,ether.example.dtb: ethernet-phy@1 \
> (ethernet-phy-id0022.1537): compatible: ['ethernet-phy-id0022.1537', 'ethernet-phy-ieee802.3-c22'] is too long
> from schema $id: http://devicetree.org/schemas/net/micrel.yaml
>
> Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Acked-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
> ---
> Documentation/devicetree/bindings/net/renesas,ether.yaml | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/renesas,ether.yaml b/Documentation/devicetree/bindings/net/renesas,ether.yaml
> index f0a52f47f95a..dd7187f12a67 100644
> --- a/Documentation/devicetree/bindings/net/renesas,ether.yaml
> +++ b/Documentation/devicetree/bindings/net/renesas,ether.yaml
> @@ -121,8 +121,7 @@ examples:
> #size-cells = <0>;
>
> phy1: ethernet-phy@1 {
> - compatible = "ethernet-phy-id0022.1537",
> - "ethernet-phy-ieee802.3-c22";
> + compatible = "ethernet-phy-id0022.1537";
> reg = <1>;
> interrupt-parent = <&irqc0>;
> interrupts = <0 IRQ_TYPE_LEVEL_LOW>;
> --
> 2.53.0
>
--
Kind Regards,
Niklas Söderlund
^ permalink raw reply
* Re: [PATCH] net: sparx5: unregister blocking notifier on init failure
From: Simon Horman @ 2026-06-24 18:16 UTC (permalink / raw)
To: Haoxiang Li
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Steen.Hegelund,
daniel.machon, UNGLinuxDriver, kees, bjarni.jonasson,
lars.povlsen, netdev, linux-arm-kernel, linux-kernel, stable
In-Reply-To: <20260623115714.2192074-1-haoxiang_li2024@163.com>
On Tue, Jun 23, 2026 at 07:57:14PM +0800, Haoxiang Li wrote:
> sparx5_register_notifier_blocks() registers the switchdev blocking
> notifier before allocating the ordered workqueue. If the workqueue
> allocation fails, the error path unregisters the switchdev and netdevice
> notifiers, but leaves the blocking notifier registered.
>
> Add a separate error label for the workqueue allocation failure path and
> unregister the switchdev blocking notifier there.
>
> Fixes: d6fce5141929 ("net: sparx5: add switching support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* [PATCH net 0/4] net: avoid nested UP notifier events
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
aleksandr.loktionov, dtatulea, Jakub Kicinski
syzbot reported that recent ethtool rework leads to deadlock
on stacked devices. VLANs create nested notifications, confusing
execution context. Bringing up dummy causes vlan to bring itself
up as well. Which in turn causes bond to ask for link state -
a call chain traveling in the opposite direction.
bond (3) bond_update_speed_duplex(vlan)
| ^ v
vlan (2) UP(vlan) (4) vlan_ethtool_get_link_ksettings()
| ^ v
dummy (1) UP(dummy) (5) __ethtool_get_link_ksettings()
We locked the instance lock of dummy at (1) and will will
try to lock it again at (5) - which of course deadlocks.
For non-nested notifications this is avoided because NETDEV_UP
is always run ops-locked (so that bond asks for link using the
netif_ API which assumes instance lock already held). The nesting,
however, makes this problematic, we cannot carry the state of
the whole chain back in the opposite direction.
AFAICT vlan is the only driver which causes such issues.
So let's try a localized fix of deferring vlan auto-open
to a workqueue.
Jakub Kicinski (4):
net: turn the rx_mode work into a generic netdev_work facility
net: add the driver-facing netdev_work scheduling API
vlan: defer real device state propagation to netdev_work
selftests: bonding: add a test for VLAN propagation over a bonded real
device
Documentation/networking/netdevices.rst | 2 +
net/core/Makefile | 2 +-
.../selftests/drivers/net/bonding/Makefile | 1 +
include/linux/netdevice.h | 21 +-
net/8021q/vlan.h | 11 ++
net/core/dev.h | 11 +-
net/8021q/vlan.c | 76 +-------
net/8021q/vlan_dev.c | 60 ++++++
net/core/dev.c | 2 +
net/core/dev_addr_lists.c | 77 +-------
net/core/netdev_work.c | 162 ++++++++++++++++
.../drivers/net/bonding/bond_vlan_real_dev.sh | 180 ++++++++++++++++++
12 files changed, 457 insertions(+), 148 deletions(-)
create mode 100644 net/core/netdev_work.c
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
--
2.54.0
^ permalink raw reply
* [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
aleksandr.loktionov, dtatulea, Jakub Kicinski
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>
The rx_mode update runs from a workqueue: drivers have their
ndo_set_rx_mode_async() callback executed by a single global
work item under RTNL and ops lock. This is a useful pattern.
Support multiple "events" that need to be serviced and make RX_MODE
sync the first one. Call the events "core" because later on
we will let drivers define and schedule their own.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
net/core/Makefile | 2 +-
include/linux/netdevice.h | 10 ++--
net/core/dev.h | 11 +++-
net/core/dev.c | 1 +
net/core/dev_addr_lists.c | 77 +------------------------
net/core/netdev_work.c | 117 ++++++++++++++++++++++++++++++++++++++
6 files changed, 138 insertions(+), 80 deletions(-)
create mode 100644 net/core/netdev_work.c
diff --git a/net/core/Makefile b/net/core/Makefile
index dc17c5a61e9a..b3fdcb4e355f 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y += dev.o dev_api.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
fib_notifier.o xdp.o flow_offload.o gro.o \
- netdev-genl.o netdev-genl-gen.o gso.o
+ netdev-genl.o netdev-genl-gen.o netdev_work.o gso.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b67a12541eac..732506787db3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1930,8 +1930,9 @@ enum netdev_reg_state {
* has been enabled due to the need to listen to
* additional unicast addresses in a device that
* does not implement ndo_set_rx_mode()
- * @rx_mode_node: List entry for rx_mode work processing
- * @rx_mode_tracker: Refcount tracker for rx_mode work
+ * @work_node: List entry for async netdev_work processing
+ * @work_tracker: Refcount tracker for async netdev_work
+ * @work_core_pending: Core-defined pending netdev_work (NETDEV_WORK_*)
* @rx_mode_addr_cache: Recycled snapshot entries for rx_mode work
* @rx_mode_retry_timer: Timer that re-queues rx_mode work after failure
* @rx_mode_retry_count: Number of consecutive retries already scheduled
@@ -2326,8 +2327,9 @@ struct net_device {
unsigned int promiscuity;
unsigned int allmulti;
bool uc_promisc;
- struct list_head rx_mode_node;
- netdevice_tracker rx_mode_tracker;
+ struct list_head work_node;
+ netdevice_tracker work_tracker;
+ unsigned long work_core_pending;
struct netdev_hw_addr_list rx_mode_addr_cache;
struct timer_list rx_mode_retry_timer;
unsigned int rx_mode_retry_count;
diff --git a/net/core/dev.h b/net/core/dev.h
index 4121c50e7c88..5d0b0305d3ba 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -167,10 +167,19 @@ int dev_change_carrier(struct net_device *dev, bool new_carrier);
void __dev_set_rx_mode(struct net_device *dev);
int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify);
void netif_rx_mode_init(struct net_device *dev);
-bool netif_rx_mode_clean(struct net_device *dev);
+void netif_rx_mode_run(struct net_device *dev);
void netif_rx_mode_sync(struct net_device *dev);
void netif_rx_mode_cancel_retry(struct net_device *dev);
+/* Events for the async netdev work, tracked in netdev->work_core_pending. */
+enum netdev_work_core {
+ NETDEV_WORK_RX_MODE = BIT(0), /* run the rx_mode update */
+};
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event);
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask);
+
void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
unsigned int gchanges, u32 portid,
const struct nlmsghdr *nlh);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5c01dfaa6c44..e1d8af0ef6ab 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -12093,6 +12093,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
INIT_LIST_HEAD(&dev->ptype_all);
INIT_LIST_HEAD(&dev->ptype_specific);
INIT_LIST_HEAD(&dev->net_notifier_list);
+ INIT_LIST_HEAD(&dev->work_node);
#ifdef CONFIG_NET_SCHED
hash_init(dev->qdisc_hash);
#endif
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e17f64a65e17..08528ca0a8b3 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -12,17 +12,10 @@
#include <linux/export.h>
#include <linux/list.h>
#include <linux/spinlock.h>
-#include <linux/workqueue.h>
#include <kunit/visibility.h>
#include "dev.h"
-static void netdev_rx_mode_work(struct work_struct *work);
-
-static LIST_HEAD(rx_mode_list);
-static DEFINE_SPINLOCK(rx_mode_lock);
-static DECLARE_WORK(rx_mode_work, netdev_rx_mode_work);
-
/*
* General list handling functions
*/
@@ -1281,7 +1274,7 @@ void netif_rx_mode_cancel_retry(struct net_device *dev)
dev->rx_mode_retry_count = 0;
}
-static void netif_rx_mode_run(struct net_device *dev)
+void netif_rx_mode_run(struct net_device *dev)
{
struct netdev_hw_addr_list uc_snap, mc_snap, uc_ref, mc_ref;
const struct net_device_ops *ops = dev->netdev_ops;
@@ -1339,49 +1332,9 @@ static void netif_rx_mode_run(struct net_device *dev)
}
}
-static void netdev_rx_mode_work(struct work_struct *work)
-{
- struct net_device *dev;
-
- rtnl_lock();
-
- while (true) {
- spin_lock_bh(&rx_mode_lock);
- if (list_empty(&rx_mode_list)) {
- spin_unlock_bh(&rx_mode_lock);
- break;
- }
- dev = list_first_entry(&rx_mode_list, struct net_device,
- rx_mode_node);
- list_del_init(&dev->rx_mode_node);
- /* We must free netdev tracker under
- * the spinlock protection.
- */
- netdev_tracker_free(dev, &dev->rx_mode_tracker);
- spin_unlock_bh(&rx_mode_lock);
-
- netdev_lock_ops(dev);
- netif_rx_mode_run(dev);
- netdev_unlock_ops(dev);
- /* Use __dev_put() because netdev_tracker_free() was already
- * called above. Must be after netdev_unlock_ops() to prevent
- * netdev_run_todo() from freeing the device while still in use.
- */
- __dev_put(dev);
- }
-
- rtnl_unlock();
-}
-
static void netif_rx_mode_queue(struct net_device *dev)
{
- spin_lock_bh(&rx_mode_lock);
- if (list_empty(&dev->rx_mode_node)) {
- list_add_tail(&dev->rx_mode_node, &rx_mode_list);
- netdev_hold(dev, &dev->rx_mode_tracker, GFP_ATOMIC);
- }
- spin_unlock_bh(&rx_mode_lock);
- schedule_work(&rx_mode_work);
+ __netdev_work_core_sched(dev, NETDEV_WORK_RX_MODE);
}
static void netif_rx_mode_retry(struct timer_list *t)
@@ -1394,7 +1347,6 @@ static void netif_rx_mode_retry(struct timer_list *t)
void netif_rx_mode_init(struct net_device *dev)
{
- INIT_LIST_HEAD(&dev->rx_mode_node);
__hw_addr_init(&dev->rx_mode_addr_cache);
timer_setup(&dev->rx_mode_retry_timer, netif_rx_mode_retry, 0);
}
@@ -1442,24 +1394,6 @@ void dev_set_rx_mode(struct net_device *dev)
netif_addr_unlock_bh(dev);
}
-bool netif_rx_mode_clean(struct net_device *dev)
-{
- bool clean = false;
-
- spin_lock_bh(&rx_mode_lock);
- if (!list_empty(&dev->rx_mode_node)) {
- list_del_init(&dev->rx_mode_node);
- clean = true;
- /* We must release netdev tracker under
- * the spinlock protection.
- */
- netdev_tracker_free(dev, &dev->rx_mode_tracker);
- }
- spin_unlock_bh(&rx_mode_lock);
-
- return clean;
-}
-
/**
* netif_rx_mode_sync() - sync rx mode inline
* @dev: network device
@@ -1473,11 +1407,6 @@ bool netif_rx_mode_clean(struct net_device *dev)
*/
void netif_rx_mode_sync(struct net_device *dev)
{
- if (netif_rx_mode_clean(dev)) {
+ if (__netdev_work_core_cancel(dev, NETDEV_WORK_RX_MODE))
netif_rx_mode_run(dev);
- /* Use __dev_put() because netdev_tracker_free() was already
- * called inside netif_rx_mode_clean().
- */
- __dev_put(dev);
- }
}
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
new file mode 100644
index 000000000000..c121c24dc493
--- /dev/null
+++ b/net/core/netdev_work.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <net/netdev_lock.h>
+
+#include "dev.h"
+
+static void netdev_work_proc(struct work_struct *work);
+
+/* @netdev_work_lock protects:
+ * - @netdev_work_list
+ * - within the list entries (struct net_device fields):
+ * - work_node
+ * - work_tracker
+ * - work_core_pending
+ */
+static LIST_HEAD(netdev_work_list);
+static DEFINE_SPINLOCK(netdev_work_lock);
+static DECLARE_WORK(netdev_work, netdev_work_proc);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+{
+ spin_lock_bh(&netdev_work_lock);
+ if (list_empty(&dev->work_node)) {
+ list_add_tail(&dev->work_node, &netdev_work_list);
+ netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
+ }
+ dev->work_core_pending |= event;
+ spin_unlock_bh(&netdev_work_lock);
+
+ schedule_work(&netdev_work);
+}
+
+/**
+ * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * @dev: net_device
+ * @mask: events to cancel
+ *
+ * Clear @mask from the device's work pending mask. If no work is left pending
+ * the device is dequeued.
+ *
+ * No expectations on locking, but also no guarantees provided. If the caller
+ * wants to touch @dev afterwards (e.g. call the work that got canceled)
+ * they have to ensure @dev does not get freed.
+ *
+ * Returns: the subset of @mask that was actually pending, so the caller can run
+ * those events inline.
+ */
+unsigned long
+__netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
+{
+ unsigned long event;
+
+ spin_lock_bh(&netdev_work_lock);
+ event = dev->work_core_pending & mask;
+ dev->work_core_pending &= ~mask;
+ if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
+ list_del_init(&dev->work_node);
+ netdev_put(dev, &dev->work_tracker);
+ }
+ spin_unlock_bh(&netdev_work_lock);
+
+ return event;
+}
+
+static void netdev_work_proc(struct work_struct *work)
+{
+ rtnl_lock();
+
+ while (true) {
+ netdevice_tracker tracker;
+ struct net_device *dev;
+ unsigned long core = 0;
+
+ spin_lock_bh(&netdev_work_lock);
+ if (list_empty(&netdev_work_list)) {
+ spin_unlock_bh(&netdev_work_lock);
+ break;
+ }
+ dev = list_first_entry(&netdev_work_list, struct net_device,
+ work_node);
+ /* Take a temporary reference so @dev can't be freed while we
+ * drop the lock to grab its ops lock; the work reference is
+ * only released once we claim the work below.
+ * The re-locking dance is to ensure that ops lock is enough
+ * to ensure canceling work is not racy with dequeue.
+ */
+ netdev_hold(dev, &tracker, GFP_ATOMIC);
+ spin_unlock_bh(&netdev_work_lock);
+
+ netdev_lock_ops(dev);
+ spin_lock_bh(&netdev_work_lock);
+ if (!list_empty(&dev->work_node)) {
+ list_del_init(&dev->work_node);
+ core = dev->work_core_pending;
+ dev->work_core_pending = 0;
+ /* We took another ref above */
+ netdev_put(dev, &dev->work_tracker);
+
+ if (!dev_isalive(dev))
+ core = 0;
+ }
+ spin_unlock_bh(&netdev_work_lock);
+
+ if (core & NETDEV_WORK_RX_MODE)
+ netif_rx_mode_run(dev);
+ netdev_unlock_ops(dev);
+
+ netdev_put(dev, &tracker);
+ }
+
+ rtnl_unlock();
+}
--
2.54.0
^ permalink raw reply related
* [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
aleksandr.loktionov, dtatulea, Jakub Kicinski
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>
With an extra event mask we can easily extend the netdev work
to also service driver-defined events. For advanced drivers
this is probably not a perfect match, but it makes running
deferred work easier in simple cases.
Expose the netdev_work facility to drivers. Add helpers
to schedule work and a dedicated ndo to perform the driver-
-scheduled actions.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
include/linux/netdevice.h | 11 ++++++
net/core/netdev_work.c | 81 ++++++++++++++++++++++++++++++---------
2 files changed, 74 insertions(+), 18 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 732506787db3..9981d637f8b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1131,6 +1131,9 @@ struct netdev_net_notifier {
* netdev_hw_addr_list_for_each(ha, uc). Return 0 on success or a
* negative errno to request a retry via the core backoff.
*
+ * void (*ndo_work)(struct net_device *dev, unsigned long events);
+ * Run deferred work scheduled with netdev_work_sched(@events).
+ *
* int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
* This function is called when the Media Access Control address
* needs to be changed. If this interface is not defined, the
@@ -1460,6 +1463,8 @@ struct net_device_ops {
struct net_device *dev,
struct netdev_hw_addr_list *uc,
struct netdev_hw_addr_list *mc);
+ void (*ndo_work)(struct net_device *dev,
+ unsigned long events);
int (*ndo_set_mac_address)(struct net_device *dev,
void *addr);
int (*ndo_validate_addr)(struct net_device *dev);
@@ -1932,6 +1937,8 @@ enum netdev_reg_state {
* does not implement ndo_set_rx_mode()
* @work_node: List entry for async netdev_work processing
* @work_tracker: Refcount tracker for async netdev_work
+ * @work_pending: Driver-defined pending netdev_work, passed to
+ * ndo_work() (see netdev_work_sched())
* @work_core_pending: Core-defined pending netdev_work (NETDEV_WORK_*)
* @rx_mode_addr_cache: Recycled snapshot entries for rx_mode work
* @rx_mode_retry_timer: Timer that re-queues rx_mode work after failure
@@ -2329,6 +2336,7 @@ struct net_device {
bool uc_promisc;
struct list_head work_node;
netdevice_tracker work_tracker;
+ unsigned long work_pending;
unsigned long work_core_pending;
struct netdev_hw_addr_list rx_mode_addr_cache;
struct timer_list rx_mode_retry_timer;
@@ -5178,6 +5186,9 @@ void dev_fetch_sw_netstats(struct rtnl_link_stats64 *s,
const struct pcpu_sw_netstats __percpu *netstats);
void dev_get_tstats64(struct net_device *dev, struct rtnl_link_stats64 *s);
+void netdev_work_sched(struct net_device *dev, unsigned long events);
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask);
+
enum {
NESTED_SYNC_IMM_BIT,
NESTED_SYNC_TODO_BIT,
diff --git a/net/core/netdev_work.c b/net/core/netdev_work.c
index c121c24dc493..3109fae132ad 100644
--- a/net/core/netdev_work.c
+++ b/net/core/netdev_work.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/export.h>
#include <linux/list.h>
#include <linux/netdevice.h>
#include <linux/rtnetlink.h>
@@ -16,32 +17,63 @@ static void netdev_work_proc(struct work_struct *work);
* - within the list entries (struct net_device fields):
* - work_node
* - work_tracker
+ * - work_pending
* - work_core_pending
*/
static LIST_HEAD(netdev_work_list);
static DEFINE_SPINLOCK(netdev_work_lock);
static DECLARE_WORK(netdev_work, netdev_work_proc);
-void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
+static void netdev_work_enqueue(struct net_device *dev, unsigned long events,
+ unsigned long core)
{
+ if (!events && !core)
+ return;
+
spin_lock_bh(&netdev_work_lock);
if (list_empty(&dev->work_node)) {
list_add_tail(&dev->work_node, &netdev_work_list);
netdev_hold(dev, &dev->work_tracker, GFP_ATOMIC);
}
- dev->work_core_pending |= event;
+ dev->work_pending |= events;
+ dev->work_core_pending |= core;
spin_unlock_bh(&netdev_work_lock);
schedule_work(&netdev_work);
}
+static unsigned long
+netdev_work_dequeue(struct net_device *dev, unsigned long *pending,
+ unsigned long mask)
+{
+ unsigned long events;
+
+ spin_lock_bh(&netdev_work_lock);
+ events = *pending & mask;
+ *pending &= ~events;
+ if (!list_empty(&dev->work_node) &&
+ !dev->work_pending && !dev->work_core_pending) {
+ list_del_init(&dev->work_node);
+ netdev_put(dev, &dev->work_tracker);
+ }
+ spin_unlock_bh(&netdev_work_lock);
+
+ return events;
+}
+
+void netdev_work_sched(struct net_device *dev, unsigned long events)
+{
+ netdev_work_enqueue(dev, events, 0);
+}
+EXPORT_SYMBOL(netdev_work_sched);
+
/**
- * __netdev_work_core_cancel() - cancel selected core work for a netdev
+ * netdev_work_cancel() - cancel selected work for a netdev
* @dev: net_device
* @mask: events to cancel
*
* Clear @mask from the device's work pending mask. If no work is left pending
- * the device is dequeued.
+ * the device is dequeued and its ndo_work won't be called.
*
* No expectations on locking, but also no guarantees provided. If the caller
* wants to touch @dev afterwards (e.g. call the work that got canceled)
@@ -50,21 +82,33 @@ void __netdev_work_core_sched(struct net_device *dev, unsigned long event)
* Returns: the subset of @mask that was actually pending, so the caller can run
* those events inline.
*/
+unsigned long netdev_work_cancel(struct net_device *dev, unsigned long mask)
+{
+ return netdev_work_dequeue(dev, &dev->work_pending, mask);
+}
+EXPORT_SYMBOL(netdev_work_cancel);
+
+void __netdev_work_core_sched(struct net_device *dev, unsigned long events)
+{
+ netdev_work_enqueue(dev, 0, events);
+}
+
unsigned long
__netdev_work_core_cancel(struct net_device *dev, unsigned long mask)
{
- unsigned long event;
+ return netdev_work_dequeue(dev, &dev->work_core_pending, mask);
+}
- spin_lock_bh(&netdev_work_lock);
- event = dev->work_core_pending & mask;
- dev->work_core_pending &= ~mask;
- if (!list_empty(&dev->work_node) && !dev->work_core_pending) {
- list_del_init(&dev->work_node);
- netdev_put(dev, &dev->work_tracker);
- }
- spin_unlock_bh(&netdev_work_lock);
+static void netdev_work_run(struct net_device *dev, unsigned long events,
+ unsigned long core)
+{
+ if (!netif_device_present(dev))
+ return;
- return event;
+ if (core & NETDEV_WORK_RX_MODE)
+ netif_rx_mode_run(dev);
+ if (events && dev->netdev_ops->ndo_work)
+ dev->netdev_ops->ndo_work(dev, events);
}
static void netdev_work_proc(struct work_struct *work)
@@ -72,9 +116,9 @@ static void netdev_work_proc(struct work_struct *work)
rtnl_lock();
while (true) {
+ unsigned long events = 0, core = 0;
netdevice_tracker tracker;
struct net_device *dev;
- unsigned long core = 0;
spin_lock_bh(&netdev_work_lock);
if (list_empty(&netdev_work_list)) {
@@ -98,16 +142,17 @@ static void netdev_work_proc(struct work_struct *work)
list_del_init(&dev->work_node);
core = dev->work_core_pending;
dev->work_core_pending = 0;
+ events = dev->work_pending;
+ dev->work_pending = 0;
/* We took another ref above */
netdev_put(dev, &dev->work_tracker);
if (!dev_isalive(dev))
- core = 0;
+ core = events = 0;
}
spin_unlock_bh(&netdev_work_lock);
- if (core & NETDEV_WORK_RX_MODE)
- netif_rx_mode_run(dev);
+ netdev_work_run(dev, events, core);
netdev_unlock_ops(dev);
netdev_put(dev, &tracker);
--
2.54.0
^ permalink raw reply related
* [PATCH net 3/4] vlan: defer real device state propagation to netdev_work
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
aleksandr.loktionov, dtatulea, Jakub Kicinski,
syzbot+09da62a8b78959ceb8bb, syzbot+cb67c392b0b8f0fd0fc1,
syzbot+9bb8bd77f3966641f298
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>
vlan_device_event() generates nested UP/DOWN, MTU and feature
change events. It executes an event for the VLAN device directly
from the notifier - while the locks of the lower device are held.
This causes deadlocks, for example:
bond (3) bond_update_speed_duplex(vlan)
| ^ v
vlan (2) UP(vlan) (4) vlan_ethtool_get_link_ksettings()
| ^ v
dummy (1) UP(dummy) (5) __ethtool_get_link_ksettings()
The dummy device is ops locked, vlan creates a nested event (2),
then bond wants to ask vlan for link state (3). bond uses the
"I'm already holding the instance lock" flavor of API. But in
this case the lock held refers to vlan itself. We hit vlan's
link settings trampoline (4) and call __ethtool_get_link_ksettings()
which tries to lock dummy. Deadlock. There's no clean way for us
to tell the vlan_ethtool_get_link_ksettings() that the caller
is already in lower device's critical section.
Defer the propagation to the per-netdev work facility instead:
the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*),
and ndo_work (vlan_dev_work) applies the change later. Hopefully
nobody expects the VLAN state changes to be instantaneous.
If someone does expect the changes to be instantaneous we will
have to do the same thing Stan did for rx_mode and "strategically"
place sync calls, to make sure such delayed works are executed
after we drop the ops lock but before we drop rtnl_lock.
Stan suggests that if we need that down the line we may
consider reshaping the mechanism into "async notifications".
AFAICT only vlan does this sort of netdev open chaining,
so as a first try I think that sticking the complexity into
the vlan code makes sense.
One corner case is that we need to cancel the event if user
explicitly changes the state before work could run. Consider
the following operations with vlan0 on top of dummy0:
ip link set dev dummy0 up # queues work to up vlan0
ip link set dev vlan0 down # user explicitly downs the vlan
ndo_work # acts on the stale event
Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com
Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com
Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com
Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
Documentation/networking/netdevices.rst | 2 +
net/8021q/vlan.h | 11 ++++
net/8021q/vlan.c | 76 +++----------------------
net/8021q/vlan_dev.c | 60 +++++++++++++++++++
net/core/dev.c | 1 +
5 files changed, 82 insertions(+), 68 deletions(-)
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index fde601acd1d2..d2a238f8cc8b 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -433,6 +433,8 @@ exceptions) notifiers run under the instance lock. Please extend this
documentation whenever you make explicit assumption about lock being held
from a notifier.
+Drivers **must not** generate nested notifications of the ops-locked types.
+
NETDEV_INTERNAL symbol namespace
================================
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index c7ffe591d593..c41caaf94095 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -125,6 +125,17 @@ static inline netdev_features_t vlan_tnl_features(struct net_device *real_dev)
int vlan_filter_push_vids(struct vlan_info *vlan_info, __be16 proto);
void vlan_filter_drop_vids(struct vlan_info *vlan_info, __be16 proto);
+/* netdev_work events propagated from the real device, see vlan_dev_work(). */
+enum {
+ VLAN_WORK_LINK_STATE = BIT(0), /* sync up/down with real_dev */
+ VLAN_WORK_MTU = BIT(1), /* clamp mtu to real_dev's */
+ VLAN_WORK_FEATURES = BIT(2), /* re-inherit real_dev features */
+};
+
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+ struct net_device *dev,
+ struct vlan_dev_priv *vlan);
+
/* found in vlan_dev.c */
void vlan_dev_set_ingress_priority(const struct net_device *dev,
u32 skb_prio, u16 vlan_prio);
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 2b74ed56eb16..2d2efb877975 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -77,9 +77,9 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
return 0;
}
-static void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
- struct net_device *dev,
- struct vlan_dev_priv *vlan)
+void vlan_stacked_transfer_operstate(const struct net_device *rootdev,
+ struct net_device *dev,
+ struct vlan_dev_priv *vlan)
{
if (!(vlan->flags & VLAN_FLAG_BRIDGE_BINDING))
netif_stacked_transfer_operstate(rootdev, dev);
@@ -316,29 +316,6 @@ static void vlan_sync_address(struct net_device *dev,
ether_addr_copy(vlan->real_dev_addr, dev->dev_addr);
}
-static void vlan_transfer_features(struct net_device *dev,
- struct net_device *vlandev)
-{
- struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
-
- netif_inherit_tso_max(vlandev, dev);
-
- if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
- vlandev->hard_header_len = dev->hard_header_len;
- else
- vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
-
-#if IS_ENABLED(CONFIG_FCOE)
- vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
-#endif
-
- vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
- vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
- vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
-
- netdev_update_features(vlandev);
-}
-
static int __vlan_device_event(struct net_device *dev, unsigned long event)
{
int err = 0;
@@ -391,13 +368,11 @@ static void vlan_vid0_del(struct net_device *dev)
static int vlan_device_event(struct notifier_block *unused, unsigned long event,
void *ptr)
{
- struct netlink_ext_ack *extack = netdev_notifier_info_to_extack(ptr);
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct vlan_group *grp;
struct vlan_info *vlan_info;
int i, flgs;
struct net_device *vlandev;
- struct vlan_dev_priv *vlan;
bool last = false;
LIST_HEAD(list);
int err;
@@ -447,54 +422,19 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
if (vlandev->mtu <= dev->mtu)
continue;
- dev_set_mtu(vlandev, dev->mtu);
+ netdev_work_sched(vlandev, VLAN_WORK_MTU);
}
break;
case NETDEV_FEAT_CHANGE:
- /* Propagate device features to underlying device */
vlan_group_for_each_dev(grp, i, vlandev)
- vlan_transfer_features(dev, vlandev);
+ netdev_work_sched(vlandev, VLAN_WORK_FEATURES);
break;
- case NETDEV_DOWN: {
- struct net_device *tmp;
- LIST_HEAD(close_list);
-
- /* Put all VLANs for this dev in the down state too. */
- vlan_group_for_each_dev(grp, i, vlandev) {
- flgs = vlandev->flags;
- if (!(flgs & IFF_UP))
- continue;
-
- vlan = vlan_dev_priv(vlandev);
- if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
- list_add(&vlandev->close_list, &close_list);
- }
-
- netif_close_many(&close_list, false);
-
- list_for_each_entry_safe(vlandev, tmp, &close_list, close_list) {
- vlan_stacked_transfer_operstate(dev, vlandev,
- vlan_dev_priv(vlandev));
- list_del_init(&vlandev->close_list);
- }
- list_del(&close_list);
- break;
- }
+ case NETDEV_DOWN:
case NETDEV_UP:
- /* Put all VLANs for this dev in the up state too. */
- vlan_group_for_each_dev(grp, i, vlandev) {
- flgs = netif_get_flags(vlandev);
- if (flgs & IFF_UP)
- continue;
-
- vlan = vlan_dev_priv(vlandev);
- if (!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
- dev_change_flags(vlandev, flgs | IFF_UP,
- extack);
- vlan_stacked_transfer_operstate(dev, vlandev, vlan);
- }
+ vlan_group_for_each_dev(grp, i, vlandev)
+ netdev_work_sched(vlandev, VLAN_WORK_LINK_STATE);
break;
case NETDEV_UNREGISTER:
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 7aa3af8b10ea..ec2569b3f8da 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -270,6 +270,9 @@ static int vlan_dev_open(struct net_device *dev)
!(vlan->flags & VLAN_FLAG_LOOSE_BINDING))
return -ENETDOWN;
+ /* The explicit open supersedes any deferred link-state sync */
+ netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
if (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr) &&
!vlan_dev_inherit_address(dev, real_dev)) {
err = dev_uc_add(real_dev, dev->dev_addr);
@@ -300,6 +303,9 @@ static int vlan_dev_stop(struct net_device *dev)
struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
struct net_device *real_dev = vlan->real_dev;
+ /* The explicit close supersedes any deferred link-state sync */
+ netdev_work_cancel(dev, VLAN_WORK_LINK_STATE);
+
dev_mc_unsync(real_dev, dev);
dev_uc_unsync(real_dev, dev);
@@ -1016,6 +1022,59 @@ static const struct ethtool_ops vlan_ethtool_ops = {
.get_ts_info = vlan_ethtool_get_ts_info,
};
+static void vlan_transfer_features(struct net_device *dev,
+ struct net_device *vlandev)
+{
+ struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+
+ netif_inherit_tso_max(vlandev, dev);
+
+ if (vlan_hw_offload_capable(dev->features, vlan->vlan_proto))
+ vlandev->hard_header_len = dev->hard_header_len;
+ else
+ vlandev->hard_header_len = dev->hard_header_len + VLAN_HLEN;
+
+#if IS_ENABLED(CONFIG_FCOE)
+ vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
+#endif
+
+ vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+ vlandev->priv_flags |= (vlan->real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
+ vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
+
+ netdev_update_features(vlandev);
+}
+
+static void vlan_dev_work(struct net_device *vlandev, unsigned long events)
+{
+ struct vlan_dev_priv *vlan = vlan_dev_priv(vlandev);
+ struct net_device *real_dev = vlan->real_dev;
+ bool loose = vlan->flags & VLAN_FLAG_LOOSE_BINDING;
+ unsigned int flgs;
+
+ if (events & VLAN_WORK_LINK_STATE) {
+ flgs = netif_get_flags(vlandev);
+ if (real_dev->flags & IFF_UP) {
+ if (!(flgs & IFF_UP)) {
+ if (!loose)
+ netif_change_flags(vlandev,
+ flgs | IFF_UP, NULL);
+ vlan_stacked_transfer_operstate(real_dev,
+ vlandev, vlan);
+ }
+ } else if ((flgs & IFF_UP) && !loose) {
+ netif_change_flags(vlandev, flgs & ~IFF_UP, NULL);
+ vlan_stacked_transfer_operstate(real_dev, vlandev, vlan);
+ }
+ }
+
+ if ((events & VLAN_WORK_MTU) && vlandev->mtu > real_dev->mtu)
+ netif_set_mtu(vlandev, real_dev->mtu);
+
+ if (events & VLAN_WORK_FEATURES)
+ vlan_transfer_features(real_dev, vlandev);
+}
+
static const struct net_device_ops vlan_netdev_ops = {
.ndo_change_mtu = vlan_dev_change_mtu,
.ndo_init = vlan_dev_init,
@@ -1027,6 +1086,7 @@ static const struct net_device_ops vlan_netdev_ops = {
.ndo_set_mac_address = vlan_dev_set_mac_address,
.ndo_set_rx_mode = vlan_dev_set_rx_mode,
.ndo_change_rx_flags = vlan_dev_change_rx_flags,
+ .ndo_work = vlan_dev_work,
.ndo_eth_ioctl = vlan_dev_ioctl,
.ndo_neigh_setup = vlan_dev_neigh_setup,
.ndo_get_stats64 = vlan_dev_get_stats64,
diff --git a/net/core/dev.c b/net/core/dev.c
index e1d8af0ef6ab..4b3d5cfdf6e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9822,6 +9822,7 @@ int netif_change_flags(struct net_device *dev, unsigned int flags,
__dev_notify_flags(dev, old_flags, changes, 0, NULL);
return ret;
}
+EXPORT_SYMBOL(netif_change_flags);
int __netif_set_mtu(struct net_device *dev, int new_mtu)
{
--
2.54.0
^ permalink raw reply related
* [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device
From: Jakub Kicinski @ 2026-06-24 18:20 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
aleksandr.loktionov, dtatulea, Jakub Kicinski
In-Reply-To: <20260624182018.2445732-1-kuba@kernel.org>
Add a regression test for the VLAN notifier handling that the netdev_work
deferral fixed.
A VLAN's real device propagates its UP/DOWN, MTU and feature changes onto
the VLANs stacked on top of it. This used to be done synchronously from the
real device's notifier and deadlocked when the real device was brought up
while enslaved to a bond (instance lock held across NETDEV_UP) and the VLAN
on top was itself a bond member: the synchronous propagation re-entered the
stack and took the same instance lock again.
The test covers both halves:
- that the deferred UP/DOWN, MTU and feature propagation actually lands on
the VLAN (link state and MTU use an ops-locked dummy, i.e. the deferral
path; features use veth, which exports vlan_features to inherit), and
- that the deadlock-prone topology - a VLAN on a dummy, with the VLAN and
the dummy each enslaved to a different bond - can be built without
hanging.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
.../selftests/drivers/net/bonding/Makefile | 1 +
.../drivers/net/bonding/bond_vlan_real_dev.sh | 180 ++++++++++++++++++
2 files changed, 181 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
diff --git a/tools/testing/selftests/drivers/net/bonding/Makefile b/tools/testing/selftests/drivers/net/bonding/Makefile
index be130bf585a4..6364ca02642d 100644
--- a/tools/testing/selftests/drivers/net/bonding/Makefile
+++ b/tools/testing/selftests/drivers/net/bonding/Makefile
@@ -13,6 +13,7 @@ TEST_PROGS := \
bond_options.sh \
bond_passive_lacp.sh \
bond_stacked_header_parse.sh \
+ bond_vlan_real_dev.sh \
dev_addr_lists.sh \
mode-1-recovery-updelay.sh \
mode-2-recovery-updelay.sh \
diff --git a/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
new file mode 100755
index 000000000000..542d9ffc4819
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bonding/bond_vlan_real_dev.sh
@@ -0,0 +1,180 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test propagation of a real device's state to the VLANs stacked on top of it
+# when the real device is (or becomes) a bond member.
+#
+# The kernel mirrors a real device's UP/DOWN, MTU and feature changes onto its
+# VLANs. This is done asynchronously (netdev_work): doing it synchronously from
+# the real device's notifier could deadlock. If the real device is brought up
+# while enslaved to a bond - so its instance lock is held across NETDEV_UP - and
+# a VLAN on top of it is itself a bond member, the synchronous propagation
+# re-entered the stack and tried to take the same instance lock again.
+#
+# Cover both halves:
+# - the deferred UP/DOWN, MTU and feature propagation actually lands on the
+# VLAN (link state and MTU use an ops-locked dummy, i.e. the deferral path),
+# - the deadlock-prone topology - a VLAN on a dummy, with the VLAN and the
+# dummy each enslaved to a different bond - can be built without hanging.
+
+ALL_TESTS="
+ vlan_link_state
+ vlan_mtu
+ vlan_features
+ vlan_real_dev_enslave
+"
+
+REQUIRE_MZ=no
+NUM_NETIFS=0
+lib_dir=$(dirname "$0")
+source "$lib_dir"/../../../net/forwarding/lib.sh
+
+# Return 0 if $dev in netns $ns has flag $flag set (e.g. UP) in its <...> flags.
+link_has_flag()
+{
+ local ns=$1 dev=$2 flag=$3
+
+ ip -n "$ns" link show dev "$dev" 2>/dev/null | grep -q "[<,]${flag}[,>]"
+}
+
+link_lacks_flag()
+{
+ ! link_has_flag "$@"
+}
+
+link_mtu_is()
+{
+ local ns=$1 dev=$2 want=$3 cur
+
+ cur=$(ip -n "$ns" link show dev "$dev" 2>/dev/null | \
+ sed -n 's/.* mtu \([0-9]\+\).*/\1/p')
+ [ "$cur" = "$want" ]
+}
+
+vlan_feature_is()
+{
+ local ns=$1 dev=$2 feature=$3 value=$4
+
+ ip netns exec "$ns" ethtool -k "$dev" 2>/dev/null | \
+ grep -q "^$feature: $value"
+}
+
+link_has_master()
+{
+ local ns=$1 dev=$2 master=$3
+
+ ip -n "$ns" -o link show dev "$dev" 2>/dev/null | grep -q "master $master"
+}
+
+vlan_link_state()
+{
+ RET=0
+
+ ip -n "$NS" link add ls_dummy type dummy
+ ip -n "$NS" link add link ls_dummy name ls_vlan type vlan id 100
+
+ # Bringing the real device up must propagate UP to the VLAN.
+ ip -n "$NS" link set ls_dummy up
+ busywait "$BUSYWAIT_TIMEOUT" link_has_flag "$NS" ls_vlan UP
+ check_err $? "VLAN did not go UP after the real device went UP"
+
+ # ... and likewise for DOWN.
+ ip -n "$NS" link set ls_dummy down
+ busywait "$BUSYWAIT_TIMEOUT" link_lacks_flag "$NS" ls_vlan UP
+ check_err $? "VLAN did not go DOWN after the real device went DOWN"
+
+ ip -n "$NS" link del ls_vlan
+ ip -n "$NS" link del ls_dummy
+
+ log_test "VLAN link state follows the real device"
+}
+
+vlan_mtu()
+{
+ RET=0
+
+ # The VLAN inherits the real device's MTU (2000) at creation time.
+ ip -n "$NS" link add mtu_dummy mtu 2000 type dummy
+ ip -n "$NS" link add link mtu_dummy name mtu_vlan type vlan id 100
+
+ # Shrinking the real device's MTU must clamp the VLAN's MTU.
+ ip -n "$NS" link set mtu_dummy mtu 1500
+ busywait "$BUSYWAIT_TIMEOUT" link_mtu_is "$NS" mtu_vlan 1500
+ check_err $? "VLAN MTU not clamped after the real device's MTU shrank"
+
+ ip -n "$NS" link del mtu_vlan
+ ip -n "$NS" link del mtu_dummy
+
+ log_test "VLAN MTU clamped to the real device"
+}
+
+vlan_features()
+{
+ RET=0
+
+ # Use veth as the real device: unlike dummy it exports vlan_features, so
+ # the VLAN actually inherits a toggleable offload to assert on.
+ ip -n "$NS" link add ft_veth type veth peer name ft_veth_pr
+ ip -n "$NS" link add link ft_veth name ft_vlan type vlan id 100
+
+ vlan_feature_is "$NS" ft_vlan scatter-gather on
+ check_err $? "VLAN did not inherit scatter-gather from the real device"
+
+ # Toggling the offload on the real device must propagate to the VLAN.
+ ip netns exec "$NS" ethtool -K ft_veth sg off
+ busywait "$BUSYWAIT_TIMEOUT" \
+ vlan_feature_is "$NS" ft_vlan scatter-gather off
+ check_err $? "VLAN scatter-gather still on after disabling it on real dev"
+
+ ip netns exec "$NS" ethtool -K ft_veth sg on
+ busywait "$BUSYWAIT_TIMEOUT" \
+ vlan_feature_is "$NS" ft_vlan scatter-gather on
+ check_err $? "VLAN scatter-gather still off after enabling it on real dev"
+
+ ip -n "$NS" link del ft_vlan
+ ip -n "$NS" link del ft_veth
+
+ log_test "VLAN features follow the real device"
+}
+
+vlan_real_dev_enslave()
+{
+ RET=0
+
+ # dummy <- VLAN -> bond0, then enslave the dummy itself to bond1. The
+ # last step brings the dummy up under bond1's instance lock, which used
+ # to deadlock while synchronously propagating UP to the (bond-enslaved)
+ # VLAN on top.
+ ip -n "$NS" link add dl_dummy type dummy
+ ip -n "$NS" link set dl_dummy up
+ ip -n "$NS" link add link dl_dummy name dl_vlan type vlan id 100
+
+ ip -n "$NS" link add dl_bond0 type bond mode active-backup
+ ip -n "$NS" link set dl_vlan down
+ ip -n "$NS" link set dl_vlan master dl_bond0
+ check_err $? "could not enslave the VLAN to bond0"
+
+ ip -n "$NS" link add dl_bond1 type bond mode active-backup
+ ip -n "$NS" link set dl_dummy down
+ ip -n "$NS" link set dl_dummy master dl_bond1
+ check_err $? "could not enslave the real device to bond1"
+
+ # If we got here the kernel did not deadlock; make sure it is still
+ # responsive and the enslave really took effect.
+ link_has_master "$NS" dl_dummy dl_bond1
+ check_err $? "real device not enslaved to bond1"
+
+ ip -n "$NS" link del dl_bond1
+ ip -n "$NS" link del dl_bond0
+ ip -n "$NS" link del dl_vlan
+ ip -n "$NS" link del dl_dummy
+
+ log_test "VLAN real device enslaved to a second bond"
+}
+
+setup_ns NS
+trap 'cleanup_ns $NS' EXIT
+
+tests_run
+
+exit "$EXIT_STATUS"
--
2.54.0
^ permalink raw reply related
* Re: [PATCH] af_unix: move proto info out of CONFIG_BPF_SYSCALL
From: Simon Horman @ 2026-06-24 18:20 UTC (permalink / raw)
To: Ben Dooks
Cc: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260623124940.791230-1-ben.dooks@codethink.co.uk>
On Tue, Jun 23, 2026 at 01:49:40PM +0100, Ben Dooks wrote:
> These two structs are defined even if CONFIG_BPF_SYSCALL but
> the header does not export them, so declare them anyway and
> move the check for CONFIG_BPF_SYSCALL lower into the file.
>
> This removes the two sparse warnings:
> net/unix/af_unix.c:1060:14: warning: symbol 'unix_dgram_proto' was not declared. Should it be static?
> net/unix/af_unix.c:1071:14: warning: symbol 'unix_stream_proto' was not declared. Should it be static?
>
> This change is less complicated than trying to make those two
> structs static based on the CONFIG_BPF_SYSCALL configuration.
>
> Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Probably this is net-next material and if so
should be reposted once net-next reopens next week.
^ permalink raw reply
* Re: [PATCH v2] netdevsim: fix use-after-free in nsim_create and __nsim_dev_port_del
From: Simon Horman @ 2026-06-24 18:35 UTC (permalink / raw)
To: Hrushiraj Gandhi
Cc: Jakub Kicinski, Andrew Lunn, David S . Miller, Eric Dumazet,
Paolo Abeni, Jiri Pirko, netdev, linux-kernel, bpf,
syzbot+6c25f4750230faf70be9
In-Reply-To: <20260623144447.255326-1-hrushirajg23@gmail.com>
On Tue, Jun 23, 2026 at 08:14:47PM +0530, Hrushiraj Gandhi wrote:
> debugfs files created under a port's ddir (ethtool/get_err,
> ethtool/set_err, ring params, bpf_offloaded_id, udp_ports/inject_error,
> etc.) store raw pointers directly into the netdevsim struct, which lives
> in the net_device private data kmalloc slab.
>
> If these files outlive the netdevsim struct, a concurrent reader can
> trigger a slab-use-after-free by passing debugfs_file_get() (which only
> checks dentry lifetime) and then dereferencing the freed data pointer
> in debugfs_u32_get().
>
> In __nsim_dev_port_del(), nsim_destroy() is called before
> nsim_dev_port_debugfs_exit(). However, nsim_destroy() calls free_netdev()
> at its end, while nsim_dev_port_debugfs_exit() removes the port's
> debugfs directory. This means the slab is freed before the debugfs
> files are removed.
>
> The same window exists on nsim_create()'s error path:
> nsim_ethtool_init() creates debugfs files under ddir with pointers into
> ns before nsim_init_netdevsim()/nsim_init_netdevsim_vf() which can fail,
> and the err_free_netdev label calls free_netdev() while those debugfs
> entries are still live.
>
> Fix both paths by calling debugfs_remove_recursive() on the port's
> ddir before every free_netdev() call. The subsequent
> nsim_dev_port_debugfs_exit() calls become harmless no-ops since ddir is
> set to NULL.
>
> Reported-by: syzbot+6c25f4750230faf70be9@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=6c25f4750230faf70be9
> Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
> Signed-off-by: Hrushiraj Gandhi <hrushirajg23@gmail.com>
> ---
> v2:
> - Also fix the same use-after-free window on the error path of nsim_create() as suggested by Simon Horman.
> - Shorten the code comment in nsim_destroy() to be more concise.
Thanks for the updates.
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH v2 net 1/2] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Xin Long @ 2026-06-24 18:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Jon Maloy, tipc-discussion, netdev,
eric.dumazet, syzbot+e14bc5d4942756023b77
In-Reply-To: <20260623173030.2925059-2-edumazet@google.com>
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> index 988b8a7f953ad..66f3cb87a0aaa 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -803,6 +803,14 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
> return err;
> }
>
> +static void rcast_free_rcu(struct rcu_head *rcu)
> +{
> + struct udp_replicast *rcast = container_of(rcu, struct udp_replicast, rcu);
> +
> + dst_cache_destroy(&rcast->dst_cache);
> + kfree(rcast);
> +}
> +
Since this adds a module-specific callback rcast_free_rcu registered with RCU
via call_rcu_hurry(), is an rcu_barrier() needed in the TIPC module exit
function?
If the module is unloaded, the RCU grace period might expire after the module
memory is freed.
net/tipc/core.c:tipc_exit() {
tipc_netlink_compat_stop();
...
pr_info("Deactivated\n");
}
Could this result in a kernel panic when RCU attempts to execute the unloaded
rcast_free_rcu function?
This sashiko report looks legit.
I think synchronize_net() doesn't guarantee rcast_free_rcu() to be done.
^ permalink raw reply
* [PATCH nf-next 4/4] netfilter: ip_vs_nfct: replace u_int8_t with u8
From: Carlos Grillet @ 2026-06-24 18:40 UTC (permalink / raw)
To: David Ahern, Ido Schimmel, Simon Horman, Julian Anastasov,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Pablo Neira Ayuso, Florian Westphal, Phil Sutter
Cc: netdev, lvs-devel, linux-kernel, netfilter-devel, coreteam
In-Reply-To: <20260624184036.71051-1-carlos@carlosgrillet.me>
Use preferred kernel integer type u8 instead of the POSIX u_int8_t
variant and update header to match definition.
No functional change.
Signed-off-by: Carlos Grillet <carlos@carlosgrillet.me>
---
include/net/ip_vs.h | 2 +-
net/netfilter/ipvs/ip_vs_nfct.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 49297fec448a..ed2e9bc1bb4e 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -2123,7 +2123,7 @@ void ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp,
int outin);
int ip_vs_confirm_conntrack(struct sk_buff *skb);
void ip_vs_nfct_expect_related(struct sk_buff *skb, struct nf_conn *ct,
- struct ip_vs_conn *cp, u_int8_t proto,
+ struct ip_vs_conn *cp, u8 proto,
const __be16 port, int from_rs);
void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp);
diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c
index 81974f69e5bb..347185fd0c8c 100644
--- a/net/netfilter/ipvs/ip_vs_nfct.c
+++ b/net/netfilter/ipvs/ip_vs_nfct.c
@@ -208,7 +208,7 @@ static void ip_vs_nfct_expect_callback(struct nf_conn *ct,
* Use port 0 to expect connection from any port.
*/
void ip_vs_nfct_expect_related(struct sk_buff *skb, struct nf_conn *ct,
- struct ip_vs_conn *cp, u_int8_t proto,
+ struct ip_vs_conn *cp, u8 proto,
const __be16 port, int from_rs)
{
struct nf_conntrack_expect *exp;
--
2.54.0
^ permalink raw reply related
* [PATCH nf-next 2/4] netfilter: nf_conntrack_h323_main: replace u_int8_t with u8
From: Carlos Grillet @ 2026-06-24 18:40 UTC (permalink / raw)
To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <20260624184036.71051-1-carlos@carlosgrillet.me>
Use preferred kernel integer type u8 instead of the POSIX u_int8_t
variant.
No functional change.
Signed-off-by: Carlos Grillet <carlos@carlosgrillet.me>
---
net/netfilter/nf_conntrack_h323_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 7f189dceb3c4..68ecaf0daf95 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -671,7 +671,7 @@ static int expect_h245(struct sk_buff *skb, struct nf_conn *ct,
static int callforward_do_filter(struct net *net,
const union nf_inet_addr *src,
const union nf_inet_addr *dst,
- u_int8_t family)
+ u8 family)
{
int ret = 0;
--
2.54.0
^ permalink raw reply related
* [PATCH nf-next 3/4] netfilter: nf_conntrack_amanda: replace u_int16_t with u16
From: Carlos Grillet @ 2026-06-24 18:40 UTC (permalink / raw)
To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <20260624184036.71051-1-carlos@carlosgrillet.me>
Use preferred kernel integer type u16 instead of the POSIX u_int16_t
variant.
No functional change.
Signed-off-by: Carlos Grillet <carlos@carlosgrillet.me>
---
net/netfilter/nf_conntrack_amanda.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_amanda.c b/net/netfilter/nf_conntrack_amanda.c
index ddafbdfc96dc..f10ac2c49f4b 100644
--- a/net/netfilter/nf_conntrack_amanda.c
+++ b/net/netfilter/nf_conntrack_amanda.c
@@ -89,7 +89,7 @@ static int amanda_help(struct sk_buff *skb,
struct nf_conntrack_tuple *tuple;
unsigned int dataoff, start, stop, off, i;
char pbuf[sizeof("65535")], *tmp;
- u_int16_t len;
+ u16 len;
__be16 port;
int ret = NF_ACCEPT;
nf_nat_amanda_hook_fn *nf_nat_amanda;
--
2.54.0
^ permalink raw reply related
* [PATCH nf-next 1/4] netfilter: nf_conntrack_sane: replace u_int16_t with u16
From: Carlos Grillet @ 2026-06-24 18:40 UTC (permalink / raw)
To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netfilter-devel, coreteam, netdev, linux-kernel
In-Reply-To: <20260624184036.71051-1-carlos@carlosgrillet.me>
Use preferred kernel integer type u16 instead of the POSIX u_int16_t
variant.
No functional change.
Signed-off-by: Carlos Grillet <carlos@carlosgrillet.me>
---
net/netfilter/nf_conntrack_sane.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_sane.c b/net/netfilter/nf_conntrack_sane.c
index 39085acf7a71..130b3e68090e 100644
--- a/net/netfilter/nf_conntrack_sane.c
+++ b/net/netfilter/nf_conntrack_sane.c
@@ -35,7 +35,7 @@ MODULE_DESCRIPTION("SANE connection tracking helper");
MODULE_ALIAS_NFCT_HELPER(HELPER_NAME);
#define MAX_PORTS 8
-static u_int16_t ports[MAX_PORTS];
+static u16 ports[MAX_PORTS];
static unsigned int ports_c;
module_param_array(ports, ushort, &ports_c, 0400);
--
2.54.0
^ permalink raw reply related
* [PATCH nf-next 0/4] netfilter: replace u_int*_t with kernel int types (batch 2)
From: Carlos Grillet @ 2026-06-24 18:40 UTC (permalink / raw)
To: Simon Horman, Julian Anastasov, David Ahern, Ido Schimmel,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Pablo Neira Ayuso, Florian Westphal, Phil Sutter
Cc: netdev, lvs-devel, linux-kernel, netfilter-devel, coreteam
This patch series replaces POSIX u_int8_t/u_int16_t with the preferred
kernel types u8/u16 across several netfilter files and updates the
corresponding header definitions.
This continues the work started in:
https://lore.kernel.org/all/20260616182948.96865-1-carlos@carlosgrillet.me
No functional changes.
Carlos Grillet (4):
netfilter: nf_conntrack_sane: replace u_int16_t with u16
netfilter: nf_conntrack_h323_main: replace u_int8_t with u8
netfilter: nf_conntrack_amanda: replace u_int16_t with u16
netfilter: ip_vs_nfct: replace u_int8_t with u8
include/net/ip_vs.h | 2 +-
net/netfilter/ipvs/ip_vs_nfct.c | 2 +-
net/netfilter/nf_conntrack_amanda.c | 2 +-
net/netfilter/nf_conntrack_h323_main.c | 2 +-
net/netfilter/nf_conntrack_sane.c | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH v2 net 1/2] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Eric Dumazet @ 2026-06-24 18:49 UTC (permalink / raw)
To: Xin Long
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Jon Maloy, tipc-discussion, netdev,
eric.dumazet, syzbot+e14bc5d4942756023b77
In-Reply-To: <CADvbK_cnZmZkzCUxGEi=uaBug3VcfUd4MiAzQp1OGUsnvau=xA@mail.gmail.com>
On Wed, Jun 24, 2026 at 11:37 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> > diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> > index 988b8a7f953ad..66f3cb87a0aaa 100644
> > --- a/net/tipc/udp_media.c
> > +++ b/net/tipc/udp_media.c
> > @@ -803,6 +803,14 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
> > return err;
> > }
> >
> > +static void rcast_free_rcu(struct rcu_head *rcu)
> > +{
> > + struct udp_replicast *rcast = container_of(rcu, struct udp_replicast, rcu);
> > +
> > + dst_cache_destroy(&rcast->dst_cache);
> > + kfree(rcast);
> > +}
> > +
> Since this adds a module-specific callback rcast_free_rcu registered with RCU
> via call_rcu_hurry(), is an rcu_barrier() needed in the TIPC module exit
> function?
There is one already, this was one of my feedback for this patch:
commit 1579342d71133da7f00daa02c75cebec7372097b
Author: Weiming Shi <bestswngs@gmail.com>
Date: Wed Jun 17 21:57:45 2026 +0800
tipc: fix use-after-free of the discoverer in tipc_disc_rcv()
> If the module is unloaded, the RCU grace period might expire after the module
> memory is freed.
> net/tipc/core.c:tipc_exit() {
> tipc_netlink_compat_stop();
> ...
> pr_info("Deactivated\n");
> }
> Could this result in a kernel panic when RCU attempts to execute the unloaded
> rcast_free_rcu function?
>
> This sashiko report looks legit.
>
> I think synchronize_net() doesn't guarantee rcast_free_rcu() to be done.
^ permalink raw reply
* [PATCH net] net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
From: Jakub Kicinski @ 2026-06-24 19:04 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
Breno Leitao, joshwash, hramamurthy, anthony.l.nguyen,
przemyslaw.kitszel, saeedm, tariqt, mbloch, leon, alexanderduyck,
kernel-team, kys, haiyangz, wei.liu, decui, longli, jordanrhee,
jacob.e.keller, nktgrg, debarghyak, mohsin.bashr, ernis, sdf, gal,
linux-rdma, linux-hyperv
Breno reports following splats on mlx5:
RTNL: assertion failed at net/core/dev.c (2241)
WARNING: net/core/dev.c:2241 at netif_state_change+0xed/0x130, CPU#5: ethtool/1335
RIP: 0010:netif_state_change+0xf9/0x130
Call Trace:
<TASK>
__linkwatch_sync_dev+0xea/0x120
ethtool_op_get_link+0xe/0x20
__ethtool_get_link+0x26/0x40
linkstate_prepare_data+0x51/0x200
ethnl_default_doit+0x213/0x470
genl_family_rcv_msg_doit+0xdd/0x110
Looks like I missed ethtool_op_get_link() trying to sync linkwatch,
which needs rtnl_lock. Not all drivers do this - bnxt doesn't,
it just returns the link state, so add an opt-in bit.
Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 45079e00133e ("net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: joshwash@google.com
CC: hramamurthy@google.com
CC: anthony.l.nguyen@intel.com
CC: przemyslaw.kitszel@intel.com
CC: saeedm@nvidia.com
CC: tariqt@nvidia.com
CC: mbloch@nvidia.com
CC: leon@kernel.org
CC: alexanderduyck@fb.com
CC: kernel-team@meta.com
CC: kys@microsoft.com
CC: haiyangz@microsoft.com
CC: wei.liu@kernel.org
CC: decui@microsoft.com
CC: longli@microsoft.com
CC: jordanrhee@google.com
CC: jacob.e.keller@intel.com
CC: nktgrg@google.com
CC: debarghyak@google.com
CC: leitao@debian.org
CC: mohsin.bashr@gmail.com
CC: ernis@linux.microsoft.com
CC: sdf@fomichev.me
CC: gal@nvidia.com
CC: linux-rdma@vger.kernel.org
CC: linux-hyperv@vger.kernel.org
---
include/linux/ethtool.h | 2 ++
net/ethtool/common.h | 4 ++++
drivers/net/ethernet/google/gve/gve_ethtool.c | 3 ++-
drivers/net/ethernet/intel/iavf/iavf_ethtool.c | 1 +
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c | 4 +++-
drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c | 3 ++-
drivers/net/ethernet/microsoft/mana/mana_ethtool.c | 3 ++-
9 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 1b834e2a522e..5d491a98265e 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -942,6 +942,7 @@ struct kernel_ethtool_ts_info {
#define ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM BIT(5)
#define ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM BIT(6)
#define ETHTOOL_OP_NEEDS_RTNL_RSS BIT(7)
+#define ETHTOOL_OP_NEEDS_RTNL_GLINK BIT(8)
/**
* struct ethtool_ops - optional netdev operations
@@ -978,6 +979,7 @@ struct kernel_ethtool_ts_info {
* - phylink helpers (note that phydev is currently unsupported!)
* - netdev_update_features()
* - netif_set_real_num_tx_queues()
+ * - ethtool_op_get_link() (syncs link watch under rtnl_lock)
*
* @get_drvinfo: Report driver/device information. Modern drivers no
* longer have to implement this callback. Most fields are
diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index 2b3847f00801..4e5356e26f40 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -113,6 +113,8 @@ ethtool_nl_msg_needs_rtnl(const struct net_device *dev, u8 cmd)
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM;
case ETHTOOL_MSG_RSS_SET:
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+ case ETHTOOL_MSG_LINKSTATE_GET:
+ return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
case ETHTOOL_MSG_TSCONFIG_GET:
case ETHTOOL_MSG_TSCONFIG_SET:
/* tsconfig calls ndos (ndo_hwtstamp_set/get), not ethtool ops.
@@ -159,6 +161,8 @@ ethtool_ioctl_needs_rtnl(const struct net_device *dev, u32 ethcmd)
case ETHTOOL_SRXFH:
case ETHTOOL_SRXFHINDIR:
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+ case ETHTOOL_GLINK:
+ return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
}
return false;
}
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index 7cc22916852f..8199738ba979 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -984,7 +984,8 @@ const struct ethtool_ops gve_ethtool_ops = {
.supported_ring_params = ETHTOOL_RING_USE_TCP_DATA_SPLIT |
ETHTOOL_RING_USE_RX_BUF_LEN,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = gve_get_drvinfo,
.get_strings = gve_get_strings,
.get_sset_count = gve_get_sset_count,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index a615d599b88e..e7cf12eaa268 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -1855,6 +1855,7 @@ static const struct ethtool_ops iavf_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.supported_input_xfrm = RXH_XFRM_SYM_XOR,
+ .op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = iavf_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ringparam = iavf_get_ringparam,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 2f5b626ba33f..112926d07634 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -2721,7 +2721,8 @@ const struct ethtool_ops mlx5e_ethtool_ops = {
.rxfh_max_num_contexts = MLX5E_MAX_NUM_RSS,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
- ETHTOOL_OP_NEEDS_RTNL_SPFLAGS,
+ ETHTOOL_OP_NEEDS_RTNL_SPFLAGS |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE |
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 1a8a19f980d3..c8b76d301c92 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -419,7 +419,8 @@ static const struct ethtool_ops mlx5e_rep_ethtool_ops = {
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5e_rep_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = mlx5e_rep_get_strings,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
index 9b3b32408c64..01ddc3def9ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
@@ -286,7 +286,8 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5i_get_drvinfo,
.get_strings = mlx5i_get_strings,
.get_sset_count = mlx5i_get_sset_count,
@@ -309,6 +310,7 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
};
const struct ethtool_ops mlx5i_pkey_ethtool_ops = {
+ .op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5i_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ts_info = mlx5i_get_ts_info,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
index cb34fc166ef9..0e47088ec44b 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
@@ -2024,7 +2024,8 @@ static const struct ethtool_ops fbnic_ethtool_ops = {
ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM |
ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM |
ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = fbnic_get_drvinfo,
.get_regs_len = fbnic_get_regs_len,
.get_regs = fbnic_get_regs,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..881df597d7f9 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -597,7 +597,8 @@ static int mana_get_link_ksettings(struct net_device *ndev,
const struct ethtool_ops mana_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_ethtool_stats = mana_get_ethtool_stats,
.get_sset_count = mana_get_sset_count,
.get_strings = mana_get_strings,
--
2.54.0
^ permalink raw reply related
* [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Yousef Alhouseen @ 2026-06-24 19:06 UTC (permalink / raw)
To: Michael S . Tsirkin, Jason Wang, Eugenio Pérez
Cc: kvm, virtualization, netdev, linux-kernel, Yousef Alhouseen
vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
size before computing the number of pages to pin. If that addition wraps,
the code can pin and map fewer pages than the requested IOTLB range.
Reject sizes that overflow the page-count calculation. Also make the
memlock check subtraction-based so a large page count cannot wrap the
pinned page total.
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
drivers/vhost/vdpa.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ac55275fa..090cb8693 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
unsigned int gup_flags = FOLL_LONGTERM;
unsigned long npages, cur_base, map_pfn, last_pfn = 0;
unsigned long lock_limit, sz2pin, nchunks, i;
+ unsigned long page_offset;
+ u64 pinned_vm;
u64 start = iova;
long pinned;
int ret = 0;
@@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
if (perm & VHOST_ACCESS_WO)
gup_flags |= FOLL_WRITE;
- npages = PFN_UP(size + (iova & ~PAGE_MASK));
+ page_offset = iova & ~PAGE_MASK;
+ if (size > ULONG_MAX - page_offset) {
+ ret = -EINVAL;
+ goto free;
+ }
+ npages = PFN_UP(size + page_offset);
if (!npages) {
ret = -EINVAL;
goto free;
@@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
mmap_read_lock(dev->mm);
lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
- if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
+ pinned_vm = atomic64_read(&dev->mm->pinned_vm);
+ if (npages > lock_limit || pinned_vm > lock_limit - npages) {
ret = -ENOMEM;
goto unlock;
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v2 net 0/2] tipc: syzbot related fixes
From: Xin Long @ 2026-06-24 19:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Jon Maloy, tipc-discussion, netdev,
eric.dumazet
In-Reply-To: <20260623173030.2925059-1-edumazet@google.com>
On Tue, Jun 23, 2026 at 1:30 PM Eric Dumazet <edumazet@google.com> wrote:
>
> First patch fixes a recent syzbot report.
>
> Second patch is inspired by numerous syzbot soft lockup
> reports with RTNL pressure.
>
> Eric Dumazet (2):
> tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
> tipc: avoid busy looping in tipc_exit_net()
>
> net/tipc/core.c | 4 ++--
> net/tipc/udp_media.c | 19 ++++++++++++++-----
> 2 files changed, 16 insertions(+), 7 deletions(-)
>
> --
> 2.55.0.rc0.799.gd6f94ed593-goog
>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
^ permalink raw reply
* Re: [PATCH] vhost/vdpa: reject overflowing PA map page counts
From: Michael S. Tsirkin @ 2026-06-24 19:53 UTC (permalink / raw)
To: Yousef Alhouseen
Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
linux-kernel
In-Reply-To: <20260624190653.2893-1-alhouseenyousef@gmail.com>
On Wed, Jun 24, 2026 at 09:06:53PM +0200, Yousef Alhouseen wrote:
> vhost_vdpa_pa_map() adds the IOVA page offset to the user-controlled map
> size before computing the number of pages to pin. If that addition wraps,
> the code can pin and map fewer pages than the requested IOTLB range.
>
> Reject sizes that overflow the page-count calculation.
You should add "on 32 bit systems" - I do not see how it can
overflow on 64 bit.
> Also make the
> memlock check subtraction-based so a large page count cannot wrap the
> pinned page total.
I don't see how this can happen at all - pinned_vm is in units of pages.
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
> drivers/vhost/vdpa.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ac55275fa..090cb8693 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1102,6 +1102,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> unsigned int gup_flags = FOLL_LONGTERM;
> unsigned long npages, cur_base, map_pfn, last_pfn = 0;
> unsigned long lock_limit, sz2pin, nchunks, i;
> + unsigned long page_offset;
> + u64 pinned_vm;
> u64 start = iova;
> long pinned;
> int ret = 0;
> @@ -1114,7 +1116,12 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> if (perm & VHOST_ACCESS_WO)
> gup_flags |= FOLL_WRITE;
>
> - npages = PFN_UP(size + (iova & ~PAGE_MASK));
> + page_offset = iova & ~PAGE_MASK;
> + if (size > ULONG_MAX - page_offset) {
> + ret = -EINVAL;
> + goto free;
> + }
> + npages = PFN_UP(size + page_offset);
> if (!npages) {
> ret = -EINVAL;
> goto free;
> @@ -1123,7 +1130,8 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> mmap_read_lock(dev->mm);
>
> lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
> - if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
> + pinned_vm = atomic64_read(&dev->mm->pinned_vm);
> + if (npages > lock_limit || pinned_vm > lock_limit - npages) {
> ret = -ENOMEM;
> goto unlock;
> }
> --
> 2.54.0
^ permalink raw reply
* Re: [PATCH net] net: enetc: fix potential divide-by-zero when num_vsi is zero
From: Maxime Chevallier @ 2026-06-24 19:54 UTC (permalink / raw)
To: wei.fang, claudiu.manoil, vladimir.oltean, xiaoning.wang,
andrew+netdev, davem, edumazet, kuba, pabeni
Cc: Frank.Li, wei.fang, imx, netdev, linux-kernel
In-Reply-To: <20260624072726.1238903-1-wei.fang@oss.nxp.com>
Hi,
On 6/24/26 09:27, wei.fang@oss.nxp.com wrote:
> From: Wei Fang <wei.fang@nxp.com>
>
> For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
> pf->caps.num_vsi is zero. This leads to a divide-by-zero in
> enetc4_default_rings_allocation() when distributing rings among PF and
> VFs.
>
> Division by zero is undefined behavior in C. On ARM64, the UDIV/SDIV
> instructions silently return zero rather than raising an exception, so
> the issue does not cause a visible crash. However, relying on this
> behavior is incorrect and poses a cross-platform compatibility risk.
>
> Add an explicit check for num_vsi == 0 and return early after the PF's
> rings have been configured.
>
> Fixes: 2d673b0e2f8d ("net: enetc: add standalone ENETC support for i.MX94")
> Signed-off-by: Wei Fang <wei.fang@nxp.com>
> ---
> drivers/net/ethernet/freescale/enetc/enetc4_pf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
> index 4e771f852358..437a15bbb47b 100644
> --- a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
> +++ b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
> @@ -322,6 +322,9 @@ static void enetc4_default_rings_allocation(struct enetc_pf *pf)
> val = enetc4_psicfgr0_val_construct(false, num_tx_bdr, num_rx_bdr);
> enetc_port_wr(hw, ENETC4_PSICFGR0(0), val);
>
> + if (!pf->caps.num_vsi)
> + return;
> +
> num_rx_bdr = pf->caps.num_rx_bdr - num_rx_bdr;
> rx_rem = num_rx_bdr % pf->caps.num_vsi;
> num_rx_bdr = num_rx_bdr / pf->caps.num_vsi;
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Maxime
^ permalink raw reply
* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Willem de Bruijn @ 2026-06-24 20:01 UTC (permalink / raw)
To: Jakub Sitnicki, Michal Luczaj, Willem de Bruijn
Cc: John Fastabend, Jiayuan Chen, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Alexei Starovoitov,
Cong Wang, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan, netdev,
bpf, linux-kernel, linux-kselftest, kuniyu
In-Reply-To: <87wlvoxdq1.fsf@cloudflare.com>
Jakub Sitnicki wrote:
> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
> > UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
> > sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
> >
> > Because sockmap accepts unbound UDP sockets, a BPF program can increment a
> > socket's refcount via lookup. If the socket is subsequently bound, the
> > transition from unbound to bound causes bpf_sk_release() to skip the
> > decrement of the refcount, causing a memory leak.
> >
> > unreferenced object 0xffff88810bc2eb40 (size 1984):
> > comm "test_progs", pid 2451, jiffies 4295320596
> > hex dump (first 32 bytes):
> > 7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00 ................
> > 02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
> > backtrace (crc bdee079d):
> > kmem_cache_alloc_noprof+0x557/0x660
> > sk_prot_alloc+0x69/0x240
> > sk_alloc+0x30/0x460
> > inet_create+0x2ce/0xf80
> > __sock_create+0x25b/0x5c0
> > __sys_socket+0x119/0x1d0
> > __x64_sys_socket+0x72/0xd0
> > do_syscall_64+0xa1/0x5f0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > Maintain balanced refcounts across sk lookup/release: (re-)set
> > SOCK_RCU_FREE on proto update to treat the socket (whether bound or
> > unbound) as not requiring a refcount increment on (a RCU protected) lookup.
> >
> > Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
> > Signed-off-by: Michal Luczaj <mhal@rbox.co>
> > ---
> > Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
> > sockets in bpf_sk_assign").
> > ---
> > net/ipv4/udp_bpf.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> > index ad57c4c9eaab..970327b59582 100644
> > --- a/net/ipv4/udp_bpf.c
> > +++ b/net/ipv4/udp_bpf.c
> > @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> > if (sk->sk_family == AF_INET6)
> > udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
> >
> > + /* Treat all sockets as non-refcounted, regardless of binding state. */
> > + sock_set_flag(sk, SOCK_RCU_FREE);
> > +
> > sock_replace_proto(sk, &udp_bpf_prots[family]);
> > return 0;
> > }
>
> There is a side effect that an unhashed (unbound) UDP socket can now be
> selected in sk_lookup with bpf_sk_assign.
The commit does mention a related fix, beneath the ---, commit
67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
That fixes a similar issue by exactly disallowing this:
Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
This matches the behaviour of __inet_lookup_skb which is ultimately
the goal of bpf_sk_assign().
So ..
> Though perhaps that's for the
> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
> sockets either, so we would have both socket dispatch sites behave the
> same way.
.. there are two conflicting types of consistency here? Consistent with
__inet_lookup_skb or the TC bpf hook. Of those the first is the more
canonical.
> Also, with this patch, if we insert & remove an unhashed UDP socket
> into/from a sockmap, we end up with an unhashed non-refcounted UDP
> socket. Not entirely sure if that is actually a problem or not.
>
> Willem, what is your take on having unhashed non-refcoted UDP sockets?
I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox