* Re: [PATCH net v2 4/4] net: mana: Fix EQ leak in mana_remove on NULL port
From: Simon Horman @ 2026-04-14 15:40 UTC (permalink / raw)
To: ernis
Cc: Simon Horman, pabeni, edumazet, ssengar, dipayanroy, gargaditya,
decui, shacharr, linux-hyperv, kotaranov, kys, shirazsaleem,
longli, kuba, andrew+netdev, wei.liu, linux-kernel, netdev,
haiyangz, kees, stephen, leon, davem
In-Reply-To: <20260413050843.605789-5-ernis@linux.microsoft.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net: mana: Fix EQ leak in mana_remove on NULL port
In mana_remove(), when a NULL port is encountered in the port iteration
loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event
queues allocated earlier by mana_create_eq(). This can happen when
mana_probe_port() fails for port 0, leaving ac->ports[0] as NULL. On
driver unload or error cleanup, mana_remove() hits the NULL entry and
jumps past mana_destroy_eq(). Change 'goto out' to 'break' so the
for-loop exits normally and mana_destroy_eq() is always reached.
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Does this Fixes tag point to the correct commit? Looking at the commit
history, the original driver commit ca9c54d2d6a5 did not have
mana_destroy_eq() in mana_remove() at all, so the 'goto out' pattern could
not have caused an EQ leak at that point.
The bug appears to have been introduced by commit 1e2d0824a9c3 ("net: mana:
Add support for EQ sharing"), which added mana_destroy_eq(ac) to
mana_remove() and placed it after the 'out:' label, making it skippable by
the 'goto out' on NULL port.
Should the Fixes tag be:
Fixes: 1e2d0824a9c3 ("net: mana: Add support for EQ sharing")
^ permalink raw reply
* Re: [PATCH v2] wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
From: Jason A. Donenfeld @ 2026-04-14 15:40 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Shardul Bankar, kuniyu, andrew+netdev, davem, edumazet, pabeni,
wireguard, netdev, linux-kernel, janak, kalpan.jani, shardulsb08,
syzbot+f2fbf7478a35a94c8b7c
In-Reply-To: <20260414081824.0edf6113@kernel.org>
On Tue, Apr 14, 2026 at 5:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 14 Apr 2026 15:28:37 +0200 Jason A. Donenfeld wrote:
> > Thanks. Applied to the wireguard tree, and also added the missing
> > __net_exit and __read_mostly annotations in the process.
>
> Hi Jason, while we have you - do you have a PR for us for wireguard?
> We're going to be sending the net-next PR later today..
Sent!
^ permalink raw reply
* [PATCH net-next 4/4] wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
From: Jason A. Donenfeld @ 2026-04-14 15:39 UTC (permalink / raw)
To: netdev, kuba, pabeni
Cc: Shardul Bankar, syzbot+f2fbf7478a35a94c8b7c, stable,
Jason A. Donenfeld
In-Reply-To: <20260414153944.2742252-1-Jason@zx2c4.com>
From: Shardul Bankar <shardul.b@mpiricsoftware.com>
wg_netns_pre_exit() manually acquires rtnl_lock() inside the
pernet .pre_exit callback. This causes a hung task when another
thread holds rtnl_mutex - the cleanup_net workqueue (or the
setup_net failure rollback path) blocks indefinitely in
wg_netns_pre_exit() waiting to acquire the lock.
Convert to .exit_rtnl, introduced in commit 7a60d91c690b ("net:
Add ->exit_rtnl() hook to struct pernet_operations."), where the
framework already holds RTNL and batches all callbacks under a
single rtnl_lock()/rtnl_unlock() pair, eliminating the contention
window.
The rcu_assign_pointer(wg->creating_net, NULL) is safe to move
from .pre_exit to .exit_rtnl (which runs after synchronize_rcu())
because all RCU readers of creating_net either use maybe_get_net()
- which returns NULL for a dying namespace with zero refcount - or
access net->user_ns which remains valid throughout the entire
ops_undo_list sequence.
Reported-by: syzbot+f2fbf7478a35a94c8b7c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=cb64c22a492202ca929e18262fdb8cb89e635c70
Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>
[ Jason: added __net_exit and __read_mostly annotations that were missing. ]
Fixes: 900575aa33a3 ("wireguard: device: avoid circular netns references")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
drivers/net/wireguard/device.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index 46a71ec36af87..67b07ee2d6600 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -411,12 +411,11 @@ static struct rtnl_link_ops link_ops __read_mostly = {
.newlink = wg_newlink,
};
-static void wg_netns_pre_exit(struct net *net)
+static void __net_exit wg_netns_exit_rtnl(struct net *net, struct list_head *dev_kill_list)
{
struct wg_device *wg;
struct wg_peer *peer;
- rtnl_lock();
list_for_each_entry(wg, &device_list, device_list) {
if (rcu_access_pointer(wg->creating_net) == net) {
pr_debug("%s: Creating namespace exiting\n", wg->dev->name);
@@ -429,11 +428,10 @@ static void wg_netns_pre_exit(struct net *net)
mutex_unlock(&wg->device_update_lock);
}
}
- rtnl_unlock();
}
-static struct pernet_operations pernet_ops = {
- .pre_exit = wg_netns_pre_exit
+static struct pernet_operations pernet_ops __read_mostly = {
+ .exit_rtnl = wg_netns_exit_rtnl
};
int __init wg_device_init(void)
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 3/4] wireguard: allowedips: remove redundant space
From: Jason A. Donenfeld @ 2026-04-14 15:39 UTC (permalink / raw)
To: netdev, kuba, pabeni; +Cc: Jason A. Donenfeld
In-Reply-To: <20260414153944.2742252-1-Jason@zx2c4.com>
Not a contentful commit, but amusingly found when porting ba3d7b93 to
Windows.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
drivers/net/wireguard/selftest/allowedips.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/wireguard/selftest/allowedips.c b/drivers/net/wireguard/selftest/allowedips.c
index 2da3008c3a014..3e857e6fb627b 100644
--- a/drivers/net/wireguard/selftest/allowedips.c
+++ b/drivers/net/wireguard/selftest/allowedips.c
@@ -623,7 +623,7 @@ bool __init wg_allowedips_selftest(void)
test_boolean(!remove(6, b, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128));
test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef);
/* invalid CIDR should have no effect and return -EINVAL */
- test_boolean(remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 129) == -EINVAL);
+ test_boolean(remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 129) == -EINVAL);
test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef);
remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128);
test_negative(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef);
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 2/4] tools: ynl: add sample for wireguard
From: Jason A. Donenfeld @ 2026-04-14 15:39 UTC (permalink / raw)
To: netdev, kuba, pabeni; +Cc: Asbjørn Sloth Tønnesen, Jason A. Donenfeld
In-Reply-To: <20260414153944.2742252-1-Jason@zx2c4.com>
From: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Add a sample application for WireGuard, using the generated C library.
The main benefit of this is to exercise the generated library,
which might be useful for future self-tests.
Example:
$ make -C tools/net/ynl/lib
$ make -C tools/net/ynl/generated
$ make -C tools/net/ynl/tests wireguard
$ ./tools/net/ynl/tests/wireguard
usage: ./tools/net/ynl/tests/wireguard <ifindex|ifname>
$ sudo ./tools/net/ynl/tests/wireguard wg-test
Interface 3: wg-test
Peer 6adfb183a4a2c94a2f92dab5ade762a4788[...]:
Data: rx: 42 / tx: 42 bytes
Allowed IPs:
0.0.0.0/0
::/0
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
tools/net/ynl/tests/.gitignore | 1 +
tools/net/ynl/tests/wireguard.c | 106 ++++++++++++++++++++++++++++++++
2 files changed, 107 insertions(+)
create mode 100644 tools/net/ynl/tests/wireguard.c
diff --git a/tools/net/ynl/tests/.gitignore b/tools/net/ynl/tests/.gitignore
index 045385df42a45..a7832ebfdbbc3 100644
--- a/tools/net/ynl/tests/.gitignore
+++ b/tools/net/ynl/tests/.gitignore
@@ -7,3 +7,4 @@ rt-link
rt-route
tc
tc-filter-add
+wireguard
diff --git a/tools/net/ynl/tests/wireguard.c b/tools/net/ynl/tests/wireguard.c
new file mode 100644
index 0000000000000..df601e742c287
--- /dev/null
+++ b/tools/net/ynl/tests/wireguard.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+#include <ynl.h>
+
+#include "wireguard-user.h"
+
+static void print_allowed_ip(const struct wireguard_wgallowedip *aip)
+{
+ char addr_out[INET6_ADDRSTRLEN];
+
+ if (!inet_ntop(aip->family, aip->ipaddr, addr_out, sizeof(addr_out))) {
+ addr_out[0] = '?';
+ addr_out[1] = '\0';
+ }
+ printf("\t\t\t%s/%u\n", addr_out, aip->cidr_mask);
+}
+
+/* Only printing public key in this demo. For better key formatting,
+ * use the constant-time implementation as found in wireguard-tools.
+ */
+static void print_peer_header(const struct wireguard_wgpeer *peer)
+{
+ unsigned int len = peer->_len.public_key;
+ uint8_t *key = peer->public_key;
+ unsigned int i;
+
+ if (len != 32)
+ return;
+ printf("\tPeer ");
+ for (i = 0; i < len; i++)
+ printf("%02x", key[i]);
+ printf(":\n");
+}
+
+static void print_peer(const struct wireguard_wgpeer *peer)
+{
+ unsigned int i;
+
+ print_peer_header(peer);
+ printf("\t\tData: rx: %llu / tx: %llu bytes\n",
+ peer->rx_bytes, peer->tx_bytes);
+ printf("\t\tAllowed IPs:\n");
+ for (i = 0; i < peer->_count.allowedips; i++)
+ print_allowed_ip(&peer->allowedips[i]);
+}
+
+static void build_request(struct wireguard_get_device_req *req, char *arg)
+{
+ char *endptr;
+ int ifindex;
+
+ ifindex = strtol(arg, &endptr, 0);
+ if (endptr != arg + strlen(arg) || errno != 0)
+ ifindex = 0;
+ if (ifindex > 0)
+ wireguard_get_device_req_set_ifindex(req, ifindex);
+ else
+ wireguard_get_device_req_set_ifname(req, arg);
+}
+
+int main(int argc, char **argv)
+{
+ struct wireguard_get_device_list *devs;
+ struct wireguard_get_device_req *req;
+ struct ynl_error yerr;
+ struct ynl_sock *ys;
+
+ if (argc < 2) {
+ fprintf(stderr, "usage: %s <ifindex|ifname>\n", argv[0]);
+ return 1;
+ }
+
+ ys = ynl_sock_create(&ynl_wireguard_family, &yerr);
+ if (!ys) {
+ fprintf(stderr, "YNL: %s\n", yerr.msg);
+ return 2;
+ }
+
+ req = wireguard_get_device_req_alloc();
+ build_request(req, argv[1]);
+
+ devs = wireguard_get_device_dump(ys, req);
+ if (!devs) {
+ fprintf(stderr, "YNL (%d): %s\n", ys->err.code, ys->err.msg);
+ wireguard_get_device_req_free(req);
+ ynl_sock_destroy(ys);
+ return 3;
+ }
+
+ ynl_dump_foreach(devs, d) {
+ unsigned int i;
+
+ printf("Interface %d: %s\n", d->ifindex, d->ifname);
+ for (i = 0; i < d->_count.peers; i++)
+ print_peer(&d->peers[i]);
+ }
+
+ wireguard_get_device_list_free(devs);
+ wireguard_get_device_req_free(req);
+ ynl_sock_destroy(ys);
+
+ return 0;
+}
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 1/4] wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
From: Jason A. Donenfeld @ 2026-04-14 15:39 UTC (permalink / raw)
To: netdev, kuba, pabeni; +Cc: Fushuai Wang, Simon Horman, Jason A. Donenfeld
In-Reply-To: <20260414153944.2742252-1-Jason@zx2c4.com>
From: Fushuai Wang <wangfushuai@baidu.com>
Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify
the code and reduce function size.
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
drivers/net/wireguard/allowedips.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c
index 09f7fcd7da78b..5ece9acad64d8 100644
--- a/drivers/net/wireguard/allowedips.c
+++ b/drivers/net/wireguard/allowedips.c
@@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack,
}
}
-static void node_free_rcu(struct rcu_head *rcu)
-{
- kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu));
-}
-
static void root_free_rcu(struct rcu_head *rcu)
{
struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = {
@@ -271,13 +266,13 @@ static void remove_node(struct allowedips_node *node, struct mutex *lock)
if (free_parent)
child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)],
lockdep_is_held(lock));
- call_rcu(&node->rcu, node_free_rcu);
+ kfree_rcu(node, rcu);
if (!free_parent)
return;
if (child)
child->parent_bit_packed = parent->parent_bit_packed;
*(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child;
- call_rcu(&parent->rcu, node_free_rcu);
+ kfree_rcu(parent, rcu);
}
static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key,
--
2.53.0
^ permalink raw reply related
* [PATCH net-next 0/4] WireGuard fixes for 7.1-rc1
From: Jason A. Donenfeld @ 2026-04-14 15:39 UTC (permalink / raw)
To: netdev, kuba, pabeni; +Cc: Jason A. Donenfeld
Hi Jakub,
Please find 4 simple patches attached:
1) Asbjørn's YNL sample, finally merged. Sorry for the wait on this one.
2) A simplification to use kfree_rcu instead of call_rcu, since
kfree_rcu now works with kmem caches.
3) A trivial formatting derp.
4) Fix for a deadlock by moving to using exit_rtnl instead of pre_exit.
Please apply these!
Thanks,
Jason
Asbjørn Sloth Tønnesen (1):
tools: ynl: add sample for wireguard
Fushuai Wang (1):
wireguard: allowedips: Use kfree_rcu() instead of call_rcu()
Jason A. Donenfeld (1):
wireguard: allowedips: remove redundant space in comment
Shardul Bankar (1):
wireguard: device: use exit_rtnl callback instead of manual rtnl_lock
in pre_exit
drivers/net/wireguard/allowedips.c | 9 +-
drivers/net/wireguard/device.c | 8 +-
drivers/net/wireguard/selftest/allowedips.c | 2 +-
tools/net/ynl/tests/.gitignore | 1 +
tools/net/ynl/tests/wireguard.c | 106 ++++++++++++++++++++
5 files changed, 113 insertions(+), 13 deletions(-)
create mode 100644 tools/net/ynl/tests/wireguard.c
--
2.53.0
^ permalink raw reply
* Re: [PATCH net-next v3 0/5] net: phy: Fix phy_init_hw() placement and update locking
From: Jakub Kicinski @ 2026-04-14 15:39 UTC (permalink / raw)
To: Andrew Lunn
Cc: Biju, Heiner Kallweit, David S. Miller, Eric Dumazet, Paolo Abeni,
Biju Das, Russell King, netdev, linux-kernel, Geert Uytterhoeven,
Prabhakar Mahadev Lad, linux-renesas-soc
In-Reply-To: <20260412140032.122841-1-biju.das.jz@bp.renesas.com>
On Sun, 12 Apr 2026 15:00:22 +0100 Biju wrote:
> This series fixes two related issues in the PHY subsystem: incorrect
> placement of phy_init_hw() in the resume path, and drop/update locking
> in several PHY drivers.
Hi Andrew, IIUC this should be applied for 7.1 but we're waiting
for Russell (who is AFK/busy) to review. Did I get that right?
^ permalink raw reply
* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: mkf @ 2026-04-14 15:37 UTC (permalink / raw)
To: Jiayuan Chen, bpf
Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
David Ahern, netdev, linux-doc, linux-kernel
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>
On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
>
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
>
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> -> bpf_setsockopt(TCP_NODELAY)
> -> tcp_push_pending_frames()
> -> tcp_current_mss()
> -> tcp_established_options()
> -> bpf_skops_hdr_opt_len()
> /* infinite recursion */
> -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
> Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> include/linux/tcp.h | 11 ++++++++++-
> net/core/filter.c | 4 ++++
> net/ipv4/tcp_minisocks.c | 1 +
> net/ipv4/tcp_output.c | 3 +++
> 5 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst
> b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int keepalive_intvl
> int linger2
> u8 bpf_sock_ops_cb_flags
> u8:1 bpf_chg_cc_inprogress
> +u8:1 bpf_hdr_opt_len_cb_inprogress
> u16 timeout_rehash
> u32 rcv_ooopack
> u32 rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
> u8 bpf_sock_ops_cb_flags; /* Control calling BPF programs
> * values defined in uapi/linux/tcp.h
> */
> - u8 bpf_chg_cc_inprogress:1; /* In the middle of
> + u8 bpf_chg_cc_inprogress:1, /* In the middle of
> * bpf_setsockopt(TCP_CONGESTION),
> * it is to avoid the bpf_tcp_cc->init()
> * to recur itself by calling
> * bpf_setsockopt(TCP_CONGESTION, "itself").
> */
> + bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> + * callback so that a nested
> + * bpf_setsockopt(TCP_NODELAY) or
> + * bpf_setsockopt(TCP_CORK) cannot
> + * trigger tcp_push_pending_frames(),
> + * which would call tcp_current_mss()
> + * -> bpf_skops_hdr_opt_len(), causing
> + * infinite recursion.
> + */
> #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
> #else
> #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 78b548158fb0..518699429a7a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
> if (sk->sk_protocol != IPPROTO_TCP)
> return -EINVAL;
>
> + if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
> + tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
> + return -EBUSY;
> +
TCP_CORK is not support in sol_tcp_sockopt(), return -EINVAL by default. and put the check here
could also prevent us from calling getsockopt(TCP_NODELAY) below.
> switch (optname) {
> case TCP_NODELAY:
> case TCP_MAXSEG:
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index dafb63b923d0..fb06c464ac16 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
>
> newtp->bpf_chg_cc_inprogress = 0;
> + newtp->bpf_hdr_opt_len_cb_inprogress = 0;
> tcp_bpf_clone(sk, newsk);
>
> __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118..c9654e690e1a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
> unsigned int *remaining)
> {
> struct bpf_sock_ops_kern sock_ops;
> + struct tcp_sock *tp = tcp_sk(sk);
> int err;
>
> if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> @@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
> if (skb)
> bpf_skops_init_skb(&sock_ops, skb, 0);
>
> + tp->bpf_hdr_opt_len_cb_inprogress = 1;
we check the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG before calling BPF_CGROUP_RUN_PROG_SOCK_OPS_SK,
could this flag use for the same purpose? so we don't need to add an extra field.
if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
!*remaining)
return;
> err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> + tp->bpf_hdr_opt_len_cb_inprogress = 0;
>
> if (err || sock_ops.remaining_opt_len == *remaining)
> return;
--
Thanks,
KaFai
^ permalink raw reply
* Re: [PATCH net v2 3/4] net: mana: Don't overwrite port probe error with add_adev result
From: Simon Horman @ 2026-04-14 15:35 UTC (permalink / raw)
To: ernis
Cc: 'Simon Horman', kys, haiyangz, wei.liu, decui, longli,
andrew+netdev, davem, edumazet, kuba, pabeni, ssengar, dipayanroy,
gargaditya, shirazsaleem, kees, kotaranov, leon, shacharr,
stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260413050843.605789-4-ernis@linux.microsoft.com>
From: 'Simon Horman' <horms@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: mana: Don't overwrite port probe error with add_adev result
This commit ensures that if mana_probe_port() or mana_attach() fails, the
error is not overwritten by the result of add_adev(). By preserving the
error, it triggers mana_remove() cleanup instead of leaving ports partially
initialized.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9d095a6fb56c7..4d42cbfb3f15b 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -3634,10 +3634,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> if (!resuming) {
> for (i = 0; i < ac->num_ports; i++) {
> err = mana_probe_port(ac, i, &ac->ports[i]);
> - /* we log the port for which the probe failed and stop
> - * probes for subsequent ports.
> - * Note that we keep running ports, for which the probes
> - * were successful, unless add_adev fails too
> + /* Log the port for which the probe failed, stop probing
> + * subsequent ports, and skip add_adev.
> + * Already-probed ports remain functional.
> */
Are these comments accurate regarding ports remaining functional?
Since this patch preserves the error code, execution falls through to the end
of mana_probe() where it checks if err is set and calls mana_remove().
mana_remove() will iterate over the successfully allocated ports, completely
detach them, unregister their net_devices, and destroy event queues.
It appears no ports remain functional because the device is completely torn
down.
[ ... ]
> @@ -3651,10 +3650,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
> enable_work(&apc->queue_reset_work);
> err = mana_attach(ac->ports[i]);
> rtnl_unlock();
> - /* we log the port for which the attach failed and stop
> - * attach for subsequent ports
> - * Note that we keep running ports, for which the attach
> - * were successful, unless add_adev fails too
> + /* Log the port for which the attach failed, stop
> + * attaching subsequent ports, and skip add_adev.
> + * Already-attached ports remain functional.
> */
> if (err) {
> dev_err(dev, "Attach Failed for port %d\n", i);
> @@ -3663,7 +3661,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
Does propagating this error during PM resume introduce a NULL pointer
dereference?
If PM resume fails here (resuming == true), the preserved error code causes
mana_probe() to call mana_remove(), which tears down the device and sets
gd->gdma_context = NULL.
However, a failed resume callback does not automatically unbind the driver.
When the device is eventually unbound later, mana_remove() is invoked again:
drivers/net/ethernet/microsoft/mana/mana_en.c:mana_remove() {
struct gdma_context *gc = gd->gdma_context;
struct device *dev = gc->dev;
...
}
In the second invocation, gd->gdma_context is NULL, meaning gc is NULL,
which could lead to a kernel panic when dereferencing gc->dev.
^ permalink raw reply
* Re: [PATCH 2/4] tools: ynl-gen-c: optionally emit structs and helpers
From: Jakub Kicinski @ 2026-04-14 15:35 UTC (permalink / raw)
To: Christoph Böhmwalder
Cc: Jens Axboe, drbd-dev, linux-kernel, Lars Ellenberg,
Philipp Reisner, linux-block, Donald Hunter, Eric Dumazet, netdev
In-Reply-To: <ad4ox7ibZoiW-tje@localhost.localdomain>
On Tue, 14 Apr 2026 14:08:58 +0200 Christoph Böhmwalder wrote:
> But we still need to support the current family via a compat path, and
> I would much rather have two YNL-based families than one genl_magic and
> one YNL-based. Carrying both sounds like a nightmare.
>
> So the spec proposed in this series would never actually be used to
> generate a userspace client, if that's what you're asking. We would
> continue to use the current libgenl-based approach, with some userspace
> compat shims to make it work with YNL. Then, when "drbd2" comes along,
> we could "do things properly".
Let's jump to the drbd2 work.
^ permalink raw reply
* [PATCH v2] net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler
From: Pavitra Jha @ 2026-04-14 15:31 UTC (permalink / raw)
To: pabeni; +Cc: w, chandrashekar.devegowda, linux-wwan, netdev, stable,
Pavitra Jha
In-Reply-To: <ad4-bTbjtxbUXDU9@1wt.eu>
t7xx_port_enum_msg_handler() uses the modem-supplied port_count field as
a loop bound over port_msg->data[] without checking that the message buffer
contains sufficient data. A modem sending port_count=65535 in a 12-byte
buffer triggers a slab-out-of-bounds read of up to 262140 bytes.
Add a struct_size() check after extracting port_count and before the loop.
Pass msg_len to t7xx_port_enum_msg_handler() and use it to validate
the message size before accessing port_msg->data[].
Pass msg_len from both call sites: skb->len at the DPMAIF path after
skb_pull(), and the captured rt_feature->data_len at the handshake path.
Fixes: 39d439047f1d ("net: wwan: t7xx: Add control DMA interface")
Cc: stable@vger.kernel.org
Reported-by: Pavitra Jha <jhapavitra98@gmail.com>
Signed-off-by: Pavitra Jha <jhapavitra98@gmail.com>
---
drivers/net/wwan/t7xx/t7xx_modem_ops.c | 14 +++++++-------
drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c | 12 +++++++++---
drivers/net/wwan/t7xx/t7xx_port_proxy.h | 2 +-
3 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/drivers/net/wwan/t7xx/t7xx_modem_ops.c b/drivers/net/wwan/t7xx/t7xx_modem_ops.c
index 7968e208d..d0559fe16 100644
--- a/drivers/net/wwan/t7xx/t7xx_modem_ops.c
+++ b/drivers/net/wwan/t7xx/t7xx_modem_ops.c
@@ -453,25 +453,25 @@ static int t7xx_parse_host_rt_data(struct t7xx_fsm_ctl *ctl, struct t7xx_sys_inf
{
enum mtk_feature_support_type ft_spt_st, ft_spt_cfg;
struct mtk_runtime_feature *rt_feature;
+ size_t feat_data_len;
int i, offset;
offset = sizeof(struct feature_query);
for (i = 0; i < FEATURE_COUNT && offset < data_length; i++) {
rt_feature = data + offset;
- offset += sizeof(*rt_feature) + le32_to_cpu(rt_feature->data_len);
-
+ feat_data_len = le32_to_cpu(rt_feature->data_len);
+ offset += sizeof(*rt_feature) + feat_data_len;
ft_spt_cfg = FIELD_GET(FEATURE_MSK, core->feature_set[i]);
if (ft_spt_cfg != MTK_FEATURE_MUST_BE_SUPPORTED)
continue;
-
ft_spt_st = FIELD_GET(FEATURE_MSK, rt_feature->support_info);
if (ft_spt_st != MTK_FEATURE_MUST_BE_SUPPORTED)
return -EINVAL;
-
- if (i == RT_ID_MD_PORT_ENUM || i == RT_ID_AP_PORT_ENUM)
- t7xx_port_enum_msg_handler(ctl->md, rt_feature->data);
+ if (i == RT_ID_MD_PORT_ENUM || i == RT_ID_AP_PORT_ENUM) {
+ t7xx_port_enum_msg_handler(ctl->md, rt_feature->data,
+ feat_data_len);
+ }
}
-
return 0;
}
diff --git a/drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c b/drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c
index ae632ef96..d984a688d 100644
--- a/drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c
+++ b/drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c
@@ -124,7 +124,7 @@ static int fsm_ee_message_handler(struct t7xx_port *port, struct t7xx_fsm_ctl *c
* * 0 - Success.
* * -EFAULT - Message check failure.
*/
-int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg)
+int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg, size_t msg_len)
{
struct device *dev = &md->t7xx_dev->pdev->dev;
unsigned int version, port_count, i;
@@ -141,6 +141,13 @@ int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg)
}
port_count = FIELD_GET(PORT_MSG_PRT_CNT, le32_to_cpu(port_msg->info));
+
+ if (msg_len < struct_size(port_msg, data, port_count)) {
+ dev_err(dev, "Port enum msg too short: need %zu, have %zu\n",
+ struct_size(port_msg, data, port_count), msg_len);
+ return -EINVAL;
+ }
+
for (i = 0; i < port_count; i++) {
u32 port_info = le32_to_cpu(port_msg->data[i]);
unsigned int ch_id;
@@ -154,7 +161,6 @@ int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg)
return 0;
}
-
static int control_msg_handler(struct t7xx_port *port, struct sk_buff *skb)
{
const struct t7xx_port_conf *port_conf = port->port_conf;
@@ -191,7 +197,7 @@ static int control_msg_handler(struct t7xx_port *port, struct sk_buff *skb)
case CTL_ID_PORT_ENUM:
skb_pull(skb, sizeof(*ctrl_msg_h));
- ret = t7xx_port_enum_msg_handler(ctl->md, (struct port_msg *)skb->data);
+ ret = t7xx_port_enum_msg_handler(ctl->md, (struct port_msg *)skb->data, skb->len);
if (!ret)
ret = port_ctl_send_msg_to_md(port, CTL_ID_PORT_ENUM, 0);
else
diff --git a/drivers/net/wwan/t7xx/t7xx_port_proxy.h b/drivers/net/wwan/t7xx/t7xx_port_proxy.h
index f0918b36e..7c3190bf0 100644
--- a/drivers/net/wwan/t7xx/t7xx_port_proxy.h
+++ b/drivers/net/wwan/t7xx/t7xx_port_proxy.h
@@ -103,7 +103,7 @@ void t7xx_port_proxy_reset(struct port_proxy *port_prox);
void t7xx_port_proxy_uninit(struct port_proxy *port_prox);
int t7xx_port_proxy_init(struct t7xx_modem *md);
void t7xx_port_proxy_md_status_notify(struct port_proxy *port_prox, unsigned int state);
-int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg);
+int t7xx_port_enum_msg_handler(struct t7xx_modem *md, void *msg, size_t msg_len);
int t7xx_port_proxy_chl_enable_disable(struct port_proxy *port_prox, unsigned int ch_id,
bool en_flag);
void t7xx_port_proxy_set_cfg(struct t7xx_modem *md, enum port_cfg_id cfg_id);
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v3 net] openvswitch: limit vport upcall portids to the number of CPUs
From: Ilya Maximets @ 2026-04-14 15:31 UTC (permalink / raw)
To: Weiming Shi, Aaron Conole, Eelco Chaudron, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: i.maximets, Simon Horman, Thomas Graf, Pravin B Shelar, Alex Wang,
netdev, dev, linux-kernel, Xiang Mei
In-Reply-To: <20260413035514.2113886-3-bestswngs@gmail.com>
On 4/13/26 5:55 AM, Weiming Shi wrote:
> The vport netlink reply helpers allocate a fixed-size skb with
> nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
> array via ovs_vport_get_upcall_portids(). Since
> ovs_vport_set_upcall_portids() accepts any non-zero multiple of
> sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
> array large enough to overflow the reply buffer, causing nla_put() to
> fail with -EMSGSIZE and hitting BUG_ON(err < 0). On systems with
> unprivileged user namespaces enabled (e.g., Ubuntu default), this is
> reachable via unshare -Urn since OVS vport mutation operations use
> GENL_UNS_ADMIN_PERM.
>
> kernel BUG at net/openvswitch/datapath.c:2414!
> Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
> CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
> RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
> Call Trace:
> <TASK>
> genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
> genl_rcv_msg (net/netlink/genetlink.c:1194)
> netlink_rcv_skb (net/netlink/af_netlink.c:2550)
> genl_rcv (net/netlink/genetlink.c:1219)
> netlink_unicast (net/netlink/af_netlink.c:1344)
> netlink_sendmsg (net/netlink/af_netlink.c:1894)
> __sys_sendto (net/socket.c:2206)
> __x64_sys_sendto (net/socket.c:2209)
> do_syscall_64 (arch/x86/entry/syscall_64.c:63)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> </TASK>
> Kernel panic - not syncing: Fatal exception
>
> Reject attempts to set more PIDs than num_possible_cpus() in
Any reason not to use nr_cpu_ids? If not, then its better to switch to
that to be consistent with the per-cpu dispatch configuration.
> ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
> size in ovs_vport_cmd_msg_size() based on that bound, similar to the
> existing ovs_dp_cmd_msg_size().
>
> Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.")
> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
> v3:
> - Cap PID array at num_possible_cpus() in ovs_vport_set_upcall_portids().
> - Add ovs_vport_cmd_msg_size() for worst-case reply allocation.
> - Keep BUG_ON()s, fix Fixes tag.
> v2:
> - Dynamically size reply skb instead of using fixed NLMSG_DEFAULT_SIZE.
> - Drop WARN_ON_ONCE; use plain error returns instead.
>
> net/openvswitch/datapath.c | 23 +++++++++++++++++++++--
> net/openvswitch/vport.c | 3 +++
> 2 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index e209099218b4..4049bfa1c4df 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -2184,9 +2184,28 @@ static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb,
> return err;
> }
>
> +static size_t ovs_vport_cmd_msg_size(void)
> +{
> + size_t msgsize = NLMSG_ALIGN(sizeof(struct ovs_header));
> +
> + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_PORT_NO */
> + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_TYPE */
> + msgsize += nla_total_size(IFNAMSIZ);
> + msgsize += nla_total_size(sizeof(u32)); /* OVS_VPORT_ATTR_IFINDEX */
> + msgsize += nla_total_size(sizeof(s32)); /* OVS_VPORT_ATTR_NETNSID */
> + msgsize += nla_total_size_64bit(sizeof(struct ovs_vport_stats));
> + msgsize += nla_total_size(nla_total_size_64bit(sizeof(u64)) +
> + nla_total_size_64bit(sizeof(u64)));
> + msgsize += nla_total_size(num_possible_cpus() * sizeof(u32));
> + msgsize += nla_total_size(nla_total_size(sizeof(u16)) +
> + nla_total_size(nla_total_size(0)));
Please, add comments about which attributes are included for each line where
it is not obvious. Plain u16 or u64, for example, are not obvious. Put them
on separate lines when they do not fit. E.g.:
/* OVS_VPORT_ATTR_OPTIONS(OVS_TUNNEL_ATTR_DST_PORT +
* OVS_TUNNEL_ATTR_EXTENSION(OVS_VXLAN_EXT_GBP))
*/
msgsize += nla_total_size(nla_total_size(sizeof(u16)) +
nla_total_size(nla_total_size(0)));
Best regards, Ilya Maximets.
^ permalink raw reply
* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Jakub Kicinski @ 2026-04-14 15:29 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Marek Vasut, netdev, stable, David S. Miller, Andrew Lunn,
Eric Dumazet, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
Yicong Hui, linux-kernel
In-Reply-To: <20260414080931.3aef9df4@kernel.org>
On Tue, 14 Apr 2026 08:09:31 -0700 Jakub Kicinski wrote:
> On Tue, 14 Apr 2026 14:57:53 +0200 Sebastian Andrzej Siewior wrote:
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Maybe I'm not being forceful enough.
>
> Putting workarounds in the drivers is unacceptable.
> __netdev_alloc_skb() must be legal to call under an _irq spin lock.
My bad, only read your reply to the old thread now.
^ permalink raw reply
* Re: Re: [PATCH,net-next] tcp: Add TCP ROCCET congestion control module.
From: Neal Cardwell @ 2026-04-14 15:26 UTC (permalink / raw)
To: Lukas Prause
Cc: Tim Fuechsel, David S. Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima,
linux-kernel, netdev
In-Reply-To: <1bce4c38-3bc3-4d59-bf36-6aa47c7a5c77@ikt.uni-hannover.de>
On Tue, Apr 14, 2026 at 7:23 AM Lukas Prause
<lukas.prause@ikt.uni-hannover.de> wrote:
>
> Thanks for the very detailed review of our code.
> We will incorporate your comments regarding documentation and variable
> usage into a new version of our code.
Sounds good. Thank you.
> > Please reference figures in the paper and mention specific concrete
> > numerical examples of latency reductions to quantify these statements.
>
> Figures 5 and 6 show the performance of ROCCET in stationary and mobile
> scenarios (https://arxiv.org/pdf/2510.25281). In the analyzed scenario,
> we have observed a lower sRTT with ROCCET than with BBRv3 and CUBIC. The
> observed throughput was marginally lower than that of BBRv3, but still
> on a similar level. A detailed quantitative evaluation can be found in
> the paper in sections VI and VII.
In https://arxiv.org/pdf/2510.25281 zooming into the Figure 6 sRTT
box-and-whisker-plot seems to show that BBRv3 actually has a lower
median sRTT value than ROCCET. So that statement seems misleading?
I would recommend using numerical examples in the commit message to
quantify the gains from ROCCET and avoid potential issues from visual
interpretation of graphs.
> > Can you please elaborate on this statement here? AFAICT from figures 7
> > and 8 in https://arxiv.org/pdf/2510.25281 it seems ROCCET is
> > essentially starved by CUBIC when sharing a bottleneck with CUBIC when
> > the bottleneck has 2*BDP or more of buffering. AFAICT it sounds like
> > ROCCET does have "fairness issues when sharing a link with TCP CUBIC"?
>
> Our main use case is a connection where the bottleneck link is in the
> cellular network, where the bottleneck queue is typically not shared
> between flows. "Fairness" between flows is being implemented by the base
> station's scheduler. In this scenario, ROCCET achieves its objective to
> not "bloat" its own queue.
>
> We have performed additional fairness experiments in non-cellular
> networks (figures 7 and 8). Here we show that even when used in other
> types of networks, ROCCET does not cause harm (see
> https://dl.acm.org/doi/10.1145/3365609.3365855) to other congestion control.
I do not see you objecting to my statement, "it seems ROCCET is
essentially starved by CUBIC when sharing a bottleneck with CUBIC when
the bottleneck has 2*BDP or more of buffering." So I guess you agree.
IMHO it's important to keep in mind that a congestion control that
starves in the presence of CUBIC may have limited deployment. This is
a key reason why Vegas was never deployed at scale.
> > Please specify what side effect or side effects ROCCET is claiming to
> > solve (presumably bufferbloat?).
> The side effect we observe in cellular networks is that, in particular,
> for loss-based congestion control, the cwnd often gets 'frozen' at a
> size that is too large for the BDP of the current link. This effect is
> caused by the TCP cwnd validation, which at some point stops increasing
> the cwnd because it assumes that the sender is application-limited.
> However, this often leads to a cwnd size that is too large for the link,
> but too small to cause a congestion event by overfilling the buffer. The
> result is a standing queue that causes permanently high RTTs. Figure 2
> in the paper (https://arxiv.org/pdf/2510.25281) shows the described
> behaviour for a single TCP CUBIC flow.
OK, so that sounds like you are describing the standard bufferbloat
problem. So you could replace the phrase "solves an unwanted side
effects of CUBIC’s implementation" in your comment with something
like: "avoids the bufferbloat problems inherent in CUBIC."
> > Expressed in isolation like this, that sounds potentially dangerous.
> > Please mention what signal(s) ROCCET uses to exit slow start if it's
> > not using loss.
> >
> > In addition, from reading the code AFAICT the connection does use loss
> > to exit slow start (see my remarks below in this message). So AFAICT
> > this summary seems inaccurate, or at least misleading?
> You are right, the summary is misleading. In the code we submitted,
> there are three conditions for exiting slow start:
> The first one is packet loss (as you already mentioned, without a cwnd
> reduction) Second is if the srRTT calculated by ROCCET exceeds an upper
> bound and ACK rate, sampled in 100ms time intervals, differs by 10
> segments. The third one is when the growth of the cwnd is stopped by the
> TCP cwnd validation (which considers the connection as
> application-limited).
OK, thanks for clarifying.
> > If no lower RTT is found for 10 seconds, the algorithm interpolates
> > the `min_rtt` upwards towards the current RTT.
> >
> > + If the path is persistently congested (e.g., a large buffer is
> > constantly full), the `min_rtt` baseline will drift up.
> >
> > + This makes the algorithm less sensitive to queueing delay over
> > time, potentially defeating the purpose of reducing bufferbloat in the
> > long run. Contrast this with BBR, which actively drains the queue
> > (using the ProbeRTT mechanism) to try to find the true physical
> > minimum RTT.
> >
> > Can you please add a comment explaining why the ROCCET algorithm takes
> > this approach, and how the algorithm expects to avoid queues that
> > ratchet ever higher?
> We added this functionality for the edge case of long-lived fat flows,
> which are experiencing routing changes, to detect a higher base RTT.
> Since this functionality is disabled by default and can also cause
> problems with min_RTT detection, we have decided to remove it.
> The measurement results in our paper have been obtained with this
> functionality disabled.
Again, thanks for clarifying.
> > Here, `cnt` is incremented by `1` on every call, regardless of the
> > `acked` value (number of packets ACKed in this event).
> You are right, we will change this.
Great. Thanks.
> > + With the default `ack_rate_diff_ca` of `200`, this condition will
> > become true for $sum_cwnd * 100 / sum_acked >= 200$, i.e.
> > $num_acks_per_round * 100 >= 200$. So AFAICT we expect this condition
> > to be true if there are 2 or more ACKs in a round trip. This makes
> > `bw_limit_detect` effectively a no-op or always-on trigger rather than
> > a true detector of queue growth or bandwidth limits.
> The purpose of this part of the code was to detect an increasing queue
> by monitoring data sent and acknowledged in combination with an
> increasing sRTT over 5 RTT time intervals. In the steady state of a TCP
> connection, the sending rate of the TCP sender should be equal to the
> receiver's ack rate, due to TCP self-clocking. The idea behind this code
> was to check if the cwnd is still correlated to the sending rate. If
> this is not the case and we also observe increasing RTTs, we assume the
> TCP sender is filling a buffer. However, we have made a mistake when
> calculating sum_cwnd:
> We are accumulating the cwnd on each ack event, instead of each RTT,
> which, as you mentioned, would make more sense. Because this leads to
> the erroneous behaviour that you described, we will remove this part of
> the code for now until we have evaluated the intended implementation.
Sounds good. Thanks.
> > Did the experiments in the paper use the approach documented in the
> > paper, or the approach documented in this code? They are very
> > different, AFAICT.
> The experiments were performed using the submitted code. This means that
> the mentioned code snippet always evaluates to true, so that ROCCET only
> reacts to changes in latency, which is different from what we described
> in the paper.
Got it. Thanks.
> > Having a module parameter to ignore loss in this way makes it too easy
> > for users to cause excessive congestion. I would urge you to remove
> > that module parameter. Researchers can add that sort of mechanism in
> > their own code for research.
> That is true, we will remove this part of the implementation.
Sounds good.
Thanks!
neal
^ permalink raw reply
* [PATCH bpf v4 5/5] bpf, sockmap: Take state lock for af_unix iter
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>
When a BPF iterator program updates a sockmap, there is a race condition in
unix_stream_bpf_update_proto() where the `peer` pointer can become stale[1]
during a state transition TCP_ESTABLISHED -> TCP_CLOSE.
CPU0 bpf CPU1 close
-------- ----------
// unix_stream_bpf_update_proto()
sk_pair = unix_peer(sk)
if (unlikely(!sk_pair))
return -EINVAL;
// unix_release_sock()
skpair = unix_peer(sk);
unix_peer(sk) = NULL;
sock_put(skpair)
sock_hold(sk_pair) // UaF
More practically, this fix guarantees that the iterator program is
consistently provided with a unix socket that remains stable during
iterator execution.
[1]:
BUG: KASAN: slab-use-after-free in unix_stream_bpf_update_proto+0x155/0x490
Write of size 4 at addr ffff8881178c9a00 by task test_progs/2231
Call Trace:
dump_stack_lvl+0x5d/0x80
print_report+0x170/0x4f3
kasan_report+0xe4/0x1c0
kasan_check_range+0x125/0x200
unix_stream_bpf_update_proto+0x155/0x490
sock_map_link+0x71c/0xec0
sock_map_update_common+0xbc/0x600
sock_map_update_elem+0x19a/0x1f0
bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
bpf_iter_run_prog+0x21e/0xae0
bpf_iter_unix_seq_show+0x1e0/0x2a0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Allocated by task 2236:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
__kasan_slab_alloc+0x63/0x80
kmem_cache_alloc_noprof+0x1d5/0x680
sk_prot_alloc+0x59/0x210
sk_alloc+0x34/0x470
unix_create1+0x86/0x8a0
unix_stream_connect+0x318/0x15b0
__sys_connect+0xfd/0x130
__x64_sys_connect+0x72/0xd0
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 2236:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x70
__kasan_slab_free+0x47/0x70
kmem_cache_free+0x11c/0x590
__sk_destruct+0x432/0x6e0
unix_release_sock+0x9b3/0xf60
unix_release+0x8a/0xf0
__sock_release+0xb0/0x270
sock_close+0x18/0x20
__fput+0x36e/0xac0
fput_close_sync+0xe5/0x1a0
__x64_sys_close+0x7d/0xd0
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/unix/af_unix.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 590a30d3b2f7..15b48cc6e9b0 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3737,6 +3737,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
return 0;
lock_sock(sk);
+ unix_state_lock(sk);
if (unlikely(sock_flag(sk, SOCK_DEAD))) {
ret = SEQ_SKIP;
@@ -3748,6 +3749,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
prog = bpf_iter_get_info(&meta, false);
ret = unix_prog_seq_show(prog, &meta, v, uid);
unlock:
+ unix_state_unlock(sk);
release_sock(sk);
return ret;
}
--
2.53.0
^ permalink raw reply related
* [PATCH bpf v4 4/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
钱一铭
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>
unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected.
Then a NULL check during proto update in unix_stream_bpf_update_proto() was
considered[2], but the follow-up discussion[3] focused on the root cause,
i.e. sockmap update taking a wrong lock. Or, more specifically, missing
unix_state_lock()[4].
In the end it was concluded that teaching sockmap about the af_unix locking
would be unnecessarily complex[5].
Complexity aside, since BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT
are allowed to update sockmaps, sock_map_update_elem() taking the unix
lock, as it is currently implemented in unix_state_lock():
spin_lock(&unix_sk(s)->lock), would be problematic. unix_state_lock() taken
in a process context, followed by a softirq-context TC BPF program
attempting to take the same spinlock -- deadlock[6].
This way we circled back to the peer check idea[2].
[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
[5]: https://lore.kernel.org/bpf/CAAVpQUAHijOMext28Gi10dSLuMzGYh+jK61Ujn+fZ-wvcODR2A@mail.gmail.com/
[6]: https://lore.kernel.org/bpf/dd043c69-4d03-46fe-8325-8f97101435cf@linux.dev/
Summary of scenarios where af_unix/stream connect() may race a sockmap
update:
1. connect() vs. bpf(BPF_MAP_UPDATE_ELEM), i.e. sock_map_update_elem_sys()
Implemented NULL check is sufficient. Once assigned, socket peer won't
be released until socket fd is released. And that's not an issue because
sock_map_update_elem_sys() bumps fd refcnf.
2. connect() vs BPF program doing update
Update restricted per verifier.c:may_update_sockmap() to
BPF_PROG_TYPE_TRACING/BPF_TRACE_ITER
BPF_PROG_TYPE_SOCK_OPS (bpf_sock_map_update() only)
BPF_PROG_TYPE_SOCKET_FILTER
BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT
BPF_PROG_TYPE_XDP
BPF_PROG_TYPE_SK_REUSEPORT
BPF_PROG_TYPE_FLOW_DISSECTOR
BPF_PROG_TYPE_SK_LOOKUP
Plus one more race to consider:
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
sk_pair = unix_peer(sk)
if (unlikely(!sk_pair))
return -EINVAL;
CPU1 close
----------
skpair = unix_peer(sk);
unix_peer(sk) = NULL;
sock_put(skpair)
// use after free?
sock_hold(sk_pair)
2.1 BPF program invoking helper function bpf_sock_map_update() ->
BPF_CALL_4(bpf_sock_map_update(), ...)
Helper limited to BPF_PROG_TYPE_SOCK_OPS. Nevertheless, a unix sock
might be accessible via bpf_map_lookup_elem(). Which implies sk
already having psock, which in turn implies sk already having
sk_pair. Since sk_psock_destroy() is queued as RCU work, sk_pair
won't go away while BPF executes the update.
2.2 BPF program invoking helper function bpf_map_update_elem() ->
sock_map_update_elem()
2.2.1 Unix sock accessible to BPF prog only via sockmap lookup in
BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_SK_LOOKUP.
Pretty much the same as case 2.1.
2.2.2 Unix sock accessible to BPF program directly:
BPF_PROG_TYPE_TRACING, narrowed down to BPF_TRACE_ITER.
Sockmap iterator (sock_map_seq_ops) is safe: unix sock
residing in a sockmap means that the sock already went through
the proto update step.
Unix sock iterator (bpf_iter_unix_seq_ops), on the other hand,
gives access to socks that may still be unconnected. Which
means iterator prog can race sockmap/proto update against
connect().
BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x253/0x4d0
Write of size 4 at addr 0000000000000080 by task test_progs/3140
Call Trace:
dump_stack_lvl+0x5d/0x80
kasan_report+0xe4/0x1c0
kasan_check_range+0x125/0x200
unix_stream_bpf_update_proto+0x253/0x4d0
sock_map_link+0x71c/0xec0
sock_map_update_common+0xbc/0x600
sock_map_update_elem+0x19a/0x1f0
bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
bpf_iter_run_prog+0x21e/0xae0
bpf_iter_unix_seq_show+0x1e0/0x2a0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0xf7/0x5e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
While the introduced NULL check prevents null-ptr-deref in the
BPF program path as well, it is insufficient to guard against
a poorly timed close() leading to a use-after-free. This will
be addressed in a subsequent patch.
Reported-by: Michal Luczaj <mhal@rbox.co>
Closes: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
Reported-by: 钱一铭 <yimingqian591@gmail.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/unix/unix_bpf.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index e0d30d6d22ac..57f3124c9d8d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -185,6 +185,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
*/
if (!psock->sk_pair) {
sk_pair = unix_peer(sk);
+ if (unlikely(!sk_pair))
+ return -EINVAL;
+
sock_hold(sk_pair);
psock->sk_pair = sk_pair;
}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2] wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
From: Jakub Kicinski @ 2026-04-14 15:18 UTC (permalink / raw)
To: Jason A. Donenfeld
Cc: Shardul Bankar, kuniyu, andrew+netdev, davem, edumazet, pabeni,
wireguard, netdev, linux-kernel, janak, kalpan.jani, shardulsb08,
syzbot+f2fbf7478a35a94c8b7c
In-Reply-To: <CAHmME9oXoXykXq_emkA3v8nG2VR28CmRP2+WmhrvGJc0ZbPfpA@mail.gmail.com>
On Tue, 14 Apr 2026 15:28:37 +0200 Jason A. Donenfeld wrote:
> Thanks. Applied to the wireguard tree, and also added the missing
> __net_exit and __read_mostly annotations in the process.
Hi Jason, while we have you - do you have a PR for us for wireguard?
We're going to be sending the net-next PR later today..
^ permalink raw reply
* [PATCH net-next 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>
The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.
Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49ee77b0939a..585d891bbbac 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
}
- kfree(mpc->das_pre);
+ kvfree(mpc->das_pre);
mpc->das_pre = NULL;
out2:
- kfree(mpc->rxbufs_pre);
+ kvfree(mpc->rxbufs_pre);
mpc->rxbufs_pre = NULL;
out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
num_rxb = num_queues * mpc->rx_queue_size;
WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
- mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+ mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
if (!mpc->rxbufs_pre)
goto error;
- mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+ mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
if (!mpc->das_pre)
goto error;
@@ -2527,7 +2527,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
if (rxq->gdma_rq)
mana_gd_destroy_queue(gc, rxq->gdma_rq);
- kfree(rxq);
+ kvfree(rxq);
}
static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2667,7 +2667,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
gc = gd->gdma_context;
- rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+ rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
if (!rxq)
return NULL;
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.
Allocation sizes that are problematic:
mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
16 queues (default): 35528 * 16 = ~555 KB contiguous
64 queues (max): 35528 * 64 = ~2220 KB contiguous
mana_create_rxq -> rxq struct with flex array
(sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
depth 1024 (default): 35712 + 296 * 1024 = ~331 KB per queue
depth 8192 (max): 35712 + 296 * 8192 = ~2403 KB per queue
mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
16 queues, depth 1024 (default): 16 * 1024 * 8 = 128 KB each
64 queues, depth 8192 (max): 64 * 8192 * 8 = 4096 KB each
This series addresses the issue by:
1. Converting the tx_qp flat array into an array of pointers with
per-queue kvzalloc (~35 KB each), replacing a single contiguous
allocation that can reach ~2.2 MB at 64 queues.
2. Switching rxbufs_pre, das_pre, and rxq allocations to
kvmalloc/kvzalloc so the allocator can fall back to vmalloc
when contiguous memory is unavailable.
Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):
Physically contiguous vmalloc region
Connections TX RX TX RX
--------------------------------------------------------------
1 47.2 46.9 46.8 46.6
16 181 181 181 181
32 181 181 181 181
64 181 181 181 181
Aditya Garg (2):
net: mana: Use per-queue allocation for tx_qp to reduce allocation
size
net: mana: Use kvmalloc for large RX queue and buffer allocations
.../net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
.../ethernet/microsoft/mana/mana_ethtool.c | 2 +-
include/net/mana/mana.h | 2 +-
4 files changed, 39 insertions(+), 28 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>
Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.
Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
.../net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
.../ethernet/microsoft/mana/mana_ethtool.c | 2 +-
include/net/mana/mana.h | 2 +-
4 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
count++;
}
- tx_stats = &apc->tx_qp[q_idx].txq.stats;
+ tx_stats = &apc->tx_qp[q_idx]->txq.stats;
u64_stats_update_begin(&tx_stats->syncp);
tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 09a53c977545..49ee77b0939a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
if (skb_cow_head(skb, MANA_HEADROOM))
goto tx_drop_count;
- txq = &apc->tx_qp[txq_idx].txq;
+ txq = &apc->tx_qp[txq_idx]->txq;
gdma_sq = txq->gdma_sq;
- cq = &apc->tx_qp[txq_idx].tx_cq;
+ cq = &apc->tx_qp[txq_idx]->tx_cq;
tx_stats = &txq->stats;
BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
}
for (q = 0; q < num_queues; q++) {
- tx_stats = &apc->tx_qp[q].txq.stats;
+ tx_stats = &apc->tx_qp[q]->txq.stats;
do {
start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2284,21 +2284,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
return;
for (i = 0; i < apc->num_queues; i++) {
- debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
- apc->tx_qp[i].mana_tx_debugfs = NULL;
+ if (!apc->tx_qp[i])
+ continue;
+
+ debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+ apc->tx_qp[i]->mana_tx_debugfs = NULL;
- napi = &apc->tx_qp[i].tx_cq.napi;
- if (apc->tx_qp[i].txq.napi_initialized) {
+ napi = &apc->tx_qp[i]->tx_cq.napi;
+ if (apc->tx_qp[i]->txq.napi_initialized) {
napi_synchronize(napi);
napi_disable_locked(napi);
netif_napi_del_locked(napi);
- apc->tx_qp[i].txq.napi_initialized = false;
+ apc->tx_qp[i]->txq.napi_initialized = false;
}
- mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+ mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
- mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+ mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
- mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+ mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+ kvfree(apc->tx_qp[i]);
}
kfree(apc->tx_qp);
@@ -2307,7 +2312,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
{
- struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+ struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
char qnum[32];
sprintf(qnum, "TX-%d", idx);
@@ -2346,7 +2351,7 @@ static int mana_create_txq(struct mana_port_context *apc,
int err;
int i;
- apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+ apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
if (!apc->tx_qp)
return -ENOMEM;
@@ -2366,10 +2371,16 @@ static int mana_create_txq(struct mana_port_context *apc,
gc = gd->gdma_context;
for (i = 0; i < apc->num_queues; i++) {
- apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+ apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+ if (!apc->tx_qp[i]) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
/* Create SQ */
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
u64_stats_init(&txq->stats.syncp);
txq->ndev = net;
@@ -2387,7 +2398,7 @@ static int mana_create_txq(struct mana_port_context *apc,
goto out;
/* Create SQ's CQ */
- cq = &apc->tx_qp[i].tx_cq;
+ cq = &apc->tx_qp[i]->tx_cq;
cq->type = MANA_CQ_TYPE_TX;
cq->txq = txq;
@@ -2416,7 +2427,7 @@ static int mana_create_txq(struct mana_port_context *apc,
err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
&wq_spec, &cq_spec,
- &apc->tx_qp[i].tx_object);
+ &apc->tx_qp[i]->tx_object);
if (err)
goto out;
@@ -3242,7 +3253,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
*/
for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
tsleep = 1000;
while (atomic_read(&txq->pending_sends) > 0 &&
time_before(jiffies, timeout)) {
@@ -3261,7 +3272,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
}
for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
while ((skb = skb_dequeue(&txq->pending_skbs))) {
mana_unmap_skb(skb, apc);
dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..f5901e4c9816 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -251,7 +251,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
}
for (q = 0; q < num_queues; q++) {
- tx_stats = &apc->tx_qp[q].txq.stats;
+ tx_stats = &apc->tx_qp[q]->txq.stats;
do {
start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..60b4a4146ea2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -505,7 +505,7 @@ struct mana_port_context {
bool tx_shortform_allowed;
u16 tx_vp_offset;
- struct mana_tx_qp *tx_qp;
+ struct mana_tx_qp **tx_qp;
/* Indirection Table for RX & TX. The values are queue indexes */
u32 *indir_table;
--
2.43.0
^ permalink raw reply related
* [PATCH net] net: pse-pd: fix out-of-bounds bitmap access in pse_isr() on 32-bit
From: Kory Maincent @ 2026-04-14 15:13 UTC (permalink / raw)
To: Kory Maincent (Dent Project), Jakub Kicinski, netdev,
linux-kernel
Cc: Carlo Szelinsky, thomas.petazzoni, Oleksij Rempel, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni
In pse_isr(), notifs_mask was declared as a single unsigned long on the
stack (32 bits on 32-bit architectures). For PSE controllers with more
than 32 ports, this causes two problems:
- map_event callbacks could wrote bit positions >= 32 via
*notifs_mask |= BIT(i), which is undefined behaviour on a 32-bit
unsigned long and corrupts adjacent stack memory.
- for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) treats
¬ifs_mask as a multi-word bitmap and reads beyond the single
unsigned long when nr_lines > BITS_PER_LONG.
Fix this by moving notifs_mask out of the stack and into struct pse_irq
as a dynamically allocated bitmap. It is sized with
BITS_TO_LONGS(pcdev->nr_lines) words in devm_pse_irq_helper(), so it
is always wide enough regardless of the host word size.
Fixes: fc0e6db30941a ("net: pse-pd: Add support for reporting events")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
drivers/net/pse-pd/pse_core.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 3beaaaeec9e1f..2ced837f375d2 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1170,6 +1170,7 @@ struct pse_irq {
struct pse_controller_dev *pcdev;
struct pse_irq_desc desc;
unsigned long *notifs;
+ unsigned long *notifs_mask;
};
/**
@@ -1247,7 +1248,6 @@ static int pse_set_config_isr(struct pse_controller_dev *pcdev, int id,
static irqreturn_t pse_isr(int irq, void *data)
{
struct pse_controller_dev *pcdev;
- unsigned long notifs_mask = 0;
struct pse_irq_desc *desc;
struct pse_irq *h = data;
int ret, i;
@@ -1257,14 +1257,15 @@ static irqreturn_t pse_isr(int irq, void *data)
/* Clear notifs mask */
memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
+ bitmap_zero(h->notifs_mask, pcdev->nr_lines);
mutex_lock(&pcdev->lock);
- ret = desc->map_event(irq, pcdev, h->notifs, ¬ifs_mask);
- if (ret || !notifs_mask) {
+ ret = desc->map_event(irq, pcdev, h->notifs, h->notifs_mask);
+ if (ret || bitmap_empty(h->notifs_mask, pcdev->nr_lines)) {
mutex_unlock(&pcdev->lock);
return IRQ_NONE;
}
- for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) {
+ for_each_set_bit(i, h->notifs_mask, pcdev->nr_lines) {
unsigned long notifs, rnotifs;
struct pse_ntf ntf = {};
@@ -1340,6 +1341,11 @@ int devm_pse_irq_helper(struct pse_controller_dev *pcdev, int irq,
if (!h->notifs)
return -ENOMEM;
+ h->notifs_mask = devm_kcalloc(dev, BITS_TO_LONGS(pcdev->nr_lines),
+ sizeof(*h->notifs_mask), GFP_KERNEL);
+ if (!h->notifs_mask)
+ return -ENOMEM;
+
ret = devm_request_threaded_irq(dev, irq, NULL, pse_isr,
IRQF_ONESHOT | irq_flags,
irq_name, h);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Lukas Wunner @ 2026-04-14 15:13 UTC (permalink / raw)
To: Emil Tantilov
Cc: intel-wired-lan, netdev, przemyslaw.kitszel, jay.bhat,
ivan.d.barrera, aleksandr.loktionov, larysa.zaremba,
anthony.l.nguyen, andrew+netdev, davem, edumazet, kuba, pabeni,
aleksander.lobakin, linux-pci, madhu.chittim, decot, willemb,
sheenamo
In-Reply-To: <20260414031631.2107-3-emil.s.tantilov@intel.com>
On Mon, Apr 13, 2026 at 08:16:31PM -0700, Emil Tantilov wrote:
> +static pci_ers_result_t
> +idpf_pci_err_slot_reset(struct pci_dev *pdev)
> +{
> + struct idpf_adapter *adapter = pci_get_drvdata(pdev);
> +
> + pci_restore_state(pdev);
> + pci_set_master(pdev);
> + pci_wake_from_d3(pdev, false);
> + if (readl(adapter->reset_reg.rstat) != 0xFFFFFFFF)
> + return PCI_ERS_RESULT_RECOVERED;
FWIW, there's a PCI_POSSIBLE_ERROR() helper that you may find useful
to check for an "all ones" MMIO read.
Thanks,
Lukas
^ permalink raw reply
* [PATCH bpf v4 2/5] bpf, sockmap: Fix af_unix iter deadlock
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>
bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
path and the iter prog attempts to update a sockmap. Which ends up spinning
at sock_map_update_elem()'s bh_lock_sock():
WARNING: possible recursive locking detected
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0
but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(slock-AF_UNIX);
lock(slock-AF_UNIX);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by test_progs/1393:
#0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
#1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
#2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
#3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00
Call Trace:
dump_stack_lvl+0x5d/0x80
print_deadlock_bug.cold+0xc0/0xce
__lock_acquire+0x130f/0x2590
lock_acquire+0x14e/0x2b0
_raw_spin_lock+0x30/0x40
sock_map_update_elem+0xdb/0x1f0
bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
bpf_iter_run_prog+0x5b9/0xb00
bpf_iter_unix_seq_show+0x1f7/0x2e0
bpf_seq_read+0x42c/0x10d0
vfs_read+0x171/0xb20
ksys_read+0xff/0x200
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/unix/af_unix.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b23c33df8b46..590a30d3b2f7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3731,15 +3731,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
struct bpf_prog *prog;
struct sock *sk = v;
uid_t uid;
- bool slow;
int ret;
if (v == SEQ_START_TOKEN)
return 0;
- slow = lock_sock_fast(sk);
+ lock_sock(sk);
- if (unlikely(sk_unhashed(sk))) {
+ if (unlikely(sock_flag(sk, SOCK_DEAD))) {
ret = SEQ_SKIP;
goto unlock;
}
@@ -3749,7 +3748,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
prog = bpf_iter_get_info(&meta, false);
ret = unix_prog_seq_show(prog, &meta, v, uid);
unlock:
- unlock_sock_fast(sk, slow);
+ release_sock(sk);
return ret;
}
--
2.53.0
^ permalink raw reply related
* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: Jason Gunthorpe @ 2026-04-14 15:11 UTC (permalink / raw)
To: fengchengwen
Cc: Leon Romanovsky, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <11eaea26-ec10-264a-db1e-951f6b46078d@huawei.com>
On Tue, Apr 14, 2026 at 10:46:00PM +0800, fengchengwen wrote:
> We have a real platform requirement:
>
> * 1. Devices in TPH Device-Specific Mode with no standard ST table
> * 2. Steering Tags must be obtained from ACPI _DSM (kernel-only)
> * 3. Devices are fully managed by userspace drivers (VFIO/UIO)
> * 4. Userspace must program STs into vendor-specific registers
No, this is nonsenscial too.
If you want to control the steering tags for MMIO BAR memory exposed
by VFIO then the DMABUF mechanism Keith & co has been working on is
the correct approach.
If the VFIO user needs to control steering tags for the device it is
directly controling then it must do that through VFIO ioctls.
Nobody messes around with other devices under the covers of the
operating kernel driver. Stop proposing that.
Jason
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox