* Re: [PATCH bpf-next v2 02/15] bpf: Make struct_ops tasks_rcu grace period optional
From: Eduard Zingerman @ 2026-06-26 22:20 UTC (permalink / raw)
To: Amery Hung, bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
shakeel.butt, roman.gushchin, kuniyu, kerneljasonxing,
kernel-team
In-Reply-To: <20260623175006.3136053-3-ameryhung@gmail.com>
On Tue, 2026-06-23 at 10:49 -0700, Amery Hung wrote:
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> bpf_struct_ops_map_free() currently waits for both a regular RCU grace
> period and a tasks RCU grace period for every struct_ops map through
> synchronize_rcu_mult(call_rcu, call_rcu_tasks).
>
> A regular RCU grace period is still required for all struct_ops maps
> because the struct_ops trampoline ksyms requires a rcu grace period
> (take a look at the list_del_rcu in __bpf_ksym_del).
> Add a map_free_pre_rcu() callback so the struct_ops map can remove
> ksyms before bpf_map_put() wait for the regular rcu grace period.
>
> The tasks RCU grace period is only needed by tcp_congestion_ops.
> Add free_after_tasks_rcu_gp only to struct bpf_struct_ops instead
> of the bpf_map.
>
> When CONFIG_TASKS_RCU=n, synchronize_rcu_tasks() is the same as
> synchronize_rcu(). Since all struct_ops maps now complete a regular RCU
> grace period before bpf_struct_ops_map_free() runs, skip the extra
> synchronize_rcu_tasks() call in this case.
>
> This cleanup prepares for a later patch that needs to support
> free_after_mult_rcu_gp.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
> @@ -997,24 +1006,8 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>
> bpf_struct_ops_map_dissoc_progs(st_map);
>
> - bpf_struct_ops_map_del_ksyms(st_map);
> -
> - /* The struct_ops's function may switch to another struct_ops.
> - *
> - * For example, bpf_tcp_cc_x->init() may switch to
> - * another tcp_cc_y by calling
> - * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
> - * and its refcount may reach 0 which then free its
> - * trampoline image while tcp_cc_x is still running.
> - *
> - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
> - * to finish. bpf-tcp-cc prog is non sleepable.
> - * A rcu_tasks gp is to wait for the last few insn
> - * in the tramopline image to finish before releasing
> - * the trampoline image.
> - */
> - synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> + if (tasks_rcu && IS_ENABLED(CONFIG_TASKS_RCU))
> + synchronize_rcu_tasks();
As far as I understand, this removes the synchronize_rcu_tasks()
for qdisk, sched_ext, smc and hid struct ops. As far as I can tell,
each one of them employs separate means to guarantee that there won't
be any pending BPF trampolines referring to the image being freed here.
So, the change appears to be safe.
>
> __bpf_struct_ops_map_free(map);
> }
[...]
^ permalink raw reply
* [PATCH] netfilter: x_tables: replace strlcat() with snprintf()
From: Ian Bridges @ 2026-06-26 22:25 UTC (permalink / raw)
To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
netfilter-devel, coreteam, netdev, linux-kernel
Cc: linux-hardening
In preparation for removing the deprecated strlcat() API[1], replace the
strscpy()/strlcat() pairs in xt_proto_init() and xt_proto_fini() with
snprintf(), which builds each /proc file name in a single call.
Each name is "<prefix><suffix>", where <prefix> is the address-family
string xt_prefix[af] and <suffix> is one of the FORMAT_TABLES,
FORMAT_MATCHES or FORMAT_TARGETS literals. snprintf() with a "%s%s"
format produces the same NUL-terminated, length-bounded string as the
strscpy()/strlcat() chain it replaces, so the proc entry names are
unchanged.
Link: https://github.com/KSPP/linux/issues/370 [1]
Signed-off-by: Ian Bridges <icb@fastmail.org>
---
net/netfilter/x_tables.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 4e6708c23922..56f4546be336 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -2033,8 +2033,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
root_uid = make_kuid(net->user_ns, 0);
root_gid = make_kgid(net->user_ns, 0);
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_TABLES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
proc = proc_create_net_data(buf, 0440, net->proc_net, &xt_table_seq_ops,
sizeof(struct seq_net_private),
(void *)(unsigned long)af);
@@ -2043,8 +2042,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
if (uid_valid(root_uid) && gid_valid(root_gid))
proc_set_user(proc, root_uid, root_gid);
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
proc = proc_create_seq_private(buf, 0440, net->proc_net,
&xt_match_seq_ops, sizeof(struct nf_mttg_trav),
(void *)(unsigned long)af);
@@ -2053,8 +2051,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
if (uid_valid(root_uid) && gid_valid(root_gid))
proc_set_user(proc, root_uid, root_gid);
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_TARGETS, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TARGETS);
proc = proc_create_seq_private(buf, 0440, net->proc_net,
&xt_target_seq_ops, sizeof(struct nf_mttg_trav),
(void *)(unsigned long)af);
@@ -2068,13 +2065,11 @@ int xt_proto_init(struct net *net, u_int8_t af)
#ifdef CONFIG_PROC_FS
out_remove_matches:
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
remove_proc_entry(buf, net->proc_net);
out_remove_tables:
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_TABLES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
remove_proc_entry(buf, net->proc_net);
out:
return -1;
@@ -2087,16 +2082,13 @@ void xt_proto_fini(struct net *net, u_int8_t af)
#ifdef CONFIG_PROC_FS
char buf[XT_FUNCTION_MAXNAMELEN];
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_TABLES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
remove_proc_entry(buf, net->proc_net);
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_TARGETS, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TARGETS);
remove_proc_entry(buf, net->proc_net);
- strscpy(buf, xt_prefix[af], sizeof(buf));
- strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+ snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
remove_proc_entry(buf, net->proc_net);
#endif /*CONFIG_PROC_FS*/
}
--
2.47.3
^ permalink raw reply related
* Re: [PATCH iwl-next v5 2/2] ice: implement symmetric RSS hash configuration
From: Jakub Kicinski @ 2026-06-26 22:26 UTC (permalink / raw)
To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260626054730.1126969-3-aleksandr.loktionov@intel.com>
On Fri, 26 Jun 2026 07:47:30 +0200 Aleksandr Loktionov wrote:
> - /* Update the VSI's hash function */
> - if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
> - hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_SYM_TPLZ;
> + /* Handle RSS symmetric hash transformation */
> + if (rxfh->input_xfrm != RXH_XFRM_NO_CHANGE) {
> + u8 new_hfunc;
I think this is the very bad part. Please extract it out and send it as
a fix to net. Looks like any changes to RSS confing on ice randomly
enable xfrm sym. I isolated it to the ntuple.py test which just changes
the indir table, and the driver says:
ice 0000:e1:00.0 ens1f0np0: Hash function set to: Symmetric Toeplitz
Which we never asked for. I drafted this before seeing your reply:
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3692,10 +3692,10 @@ ice_set_rxfh(struct net_device *netdev, struct ethtool_rxfh_param *rxfh,
struct netlink_ext_ack *extack)
{
struct ice_netdev_priv *np = netdev_priv(netdev);
- u8 hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_TPLZ;
struct ice_vsi *vsi = np->vsi;
struct ice_pf *pf = vsi->back;
struct device *dev;
+ u8 hfunc;
int err;
dev = ice_pf_to_dev(pf);
@@ -3714,9 +3714,12 @@ ice_set_rxfh(struct net_device *netdev, struct ethtool_rxfh_param *rxfh,
return -EOPNOTSUPP;
}
- /* Update the VSI's hash function */
- if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
+ if (rxfh->input_xfrm == RXH_XFRM_NO_CHANGE)
+ hfunc = vsi->rss_hfunc;
+ else if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_SYM_TPLZ;
+ else /* input_xfrm == 0; core rejects any other value */
+ hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_TPLZ;
err = ice_set_rss_hfunc(vsi, hfunc);
^ permalink raw reply
* Re: [PATCH iwl-next v5 1/2] ethtool: treat RXH_GTP_TEID as intrinsically symmetric
From: Jakub Kicinski @ 2026-06-26 22:29 UTC (permalink / raw)
To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260626054730.1126969-2-aleksandr.loktionov@intel.com>
On Fri, 26 Jun 2026 07:47:29 +0200 Aleksandr Loktionov wrote:
> A GTP tunnel uses the same TEID value in both directions of a flow;
> including TEID in the hash input does not break src/dst symmetry.
>
> ethtool_rxfh_config_is_sym() currently rejects any hash field bitmap
> that contains bits outside the four paired L3/L4 fields. This causes
> drivers that hash GTP flows on TEID to fail the kernel's preflight
> validation in ethtool_check_flow_types(), making it impossible for
> those drivers to support symmetric-xor transforms at all.
>
> Strip RXH_GTP_TEID from the bitmap before the paired-field check so
> that drivers may honestly report TEID hashing without blocking the
> configuration of symmetric transforms.
I don't know much about GTP, but "the Internet" does not seem to agree
with your claim:
The TEID uniquely identifies the GSN tunnel endpoints. The tunnels
for an uplink and a downlink are separate and use a different TEID.
https://docs.paloaltonetworks.com/service-providers/10-1/mobile-network-infrastructure-getting-started/gtp/mobile-network-protection-profile
So I don't think this will fly..
^ permalink raw reply
* [PATCH net] net: gianfar: dispose irq mappings on probe failure and device removal
From: Rosen Penev @ 2026-06-26 22:52 UTC (permalink / raw)
To: netdev
Cc: Claudiu Manoil, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Andy Fleming, open list
irq_of_parse_and_map() creates irqdomain mappings that should be
balanced with irq_dispose_mapping(). The driver never called
irq_dispose_mapping(), leaking mappings on probe failure and
device removal.
Fix by adding irq_dispose_mapping() in free_gfar_dev() and
expanding its loop from priv->num_grps to MAXGROUPS so the
error path also catches partially-initialized groups. All
irqinfo pointers are pre-initialized to NULL in gfar_of_init(),
making the NULL-guarded walk in free_gfar_dev() safe for every
scenario.
gfar_parse_group() itself is left as a simple parse function
with no resource management; cleanup is centralized in the
caller's error path.
Assisted-by: opencode:big-pickle
Fixes: b31a1d8b4151 ("gianfar: Convert gianfar to an of_platform_driver")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
drivers/net/ethernet/freescale/gianfar.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 3271de5844f8..89215e1ddc2d 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -469,10 +469,13 @@ static void free_gfar_dev(struct gfar_private *priv)
{
int i, j;
- for (i = 0; i < priv->num_grps; i++)
+ for (i = 0; i < MAXGROUPS; i++)
for (j = 0; j < GFAR_NUM_IRQS; j++) {
- kfree(priv->gfargrp[i].irqinfo[j]);
- priv->gfargrp[i].irqinfo[j] = NULL;
+ if (priv->gfargrp[i].irqinfo[j]) {
+ irq_dispose_mapping(priv->gfargrp[i].irqinfo[j]->irq);
+ kfree(priv->gfargrp[i].irqinfo[j]);
+ priv->gfargrp[i].irqinfo[j] = NULL;
+ }
}
free_netdev(priv->ndev);
@@ -616,7 +619,7 @@ static phy_interface_t gfar_get_interface(struct net_device *dev)
static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
{
const char *model;
- int err = 0, i;
+ int err = 0, i, j;
phy_interface_t interface;
struct net_device *dev = NULL;
struct gfar_private *priv = NULL;
@@ -702,8 +705,11 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
priv->rx_list.count = 0;
mutex_init(&priv->rx_queue_access);
- for (i = 0; i < MAXGROUPS; i++)
+ for (i = 0; i < MAXGROUPS; i++) {
priv->gfargrp[i].regs = NULL;
+ for (j = 0; j < GFAR_NUM_IRQS; j++)
+ priv->gfargrp[i].irqinfo[j] = NULL;
+ }
/* Parse and initialize group specific information */
if (priv->mode == MQ_MG_MODE) {
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: Kuniyuki Iwashima @ 2026-06-26 23:40 UTC (permalink / raw)
To: xuanqiang.luo
Cc: Eric Dumazet, Neal Cardwell, netdev, David S . Miller,
Jakub Kicinski, Paolo Abeni, Simon Horman, luoxuanqiang
In-Reply-To: <20260626093856.61864-1-xuanqiang.luo@linux.dev>
On Fri, Jun 26, 2026 at 2:40 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>
> IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
> range on a per-socket basis. __inet_hash_connect() already treats such a
> range as an explicit application partition and scans it with step 1 [1].
>
> Do the same in inet_csk_find_open_port():
What's the use case of IP_LOCAL_PORT_RANGE + bind(, 0)
without IP_BIND_ADDRESS_NO_PORT ?
> when a socket-local range is set,
> walk the whole selected range instead of first splitting it by parity.
> Keep the existing step-2 parity behavior for sockets using the netns range,
> so the default bind/connect separation remains unchanged.
>
> [1] https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
> ---
> net/ipv4/inet_connection_sock.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 56902bba54838..ad8af70c92ca3 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -323,13 +323,16 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> struct inet_bind2_bucket *tb2;
> struct inet_bind_bucket *tb;
> u32 remaining, offset;
> + bool local_ports;
> bool relax = false;
> + int step;
>
> l3mdev = inet_sk_bound_l3mdev(sk);
> ports_exhausted:
> attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
> other_half_scan:
> - inet_sk_get_local_port_range(sk, &low, &high);
> + local_ports = inet_sk_get_local_port_range(sk, &low, &high);
> + step = local_ports ? 1 : 2;
> high++; /* [32768, 60999] -> [32768, 61000[ */
> if (high - low < 4)
> attempt_half = 0;
> @@ -342,18 +345,19 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> low = half;
> }
> remaining = high - low;
> - if (likely(remaining > 1))
> + if (!local_ports && remaining > 1)
> remaining &= ~1U;
>
> offset = get_random_u32_below(remaining);
> /* __inet_hash_connect() favors ports having @low parity
> * We do the opposite to not pollute connect() users.
> */
> - offset |= 1U;
> + if (!local_ports)
> + offset |= 1U;
>
> other_parity_scan:
> port = low + offset;
> - for (i = 0; i < remaining; i += 2, port += 2) {
> + for (i = 0; i < remaining; i += step, port += step) {
> if (unlikely(port >= high))
> port -= remaining;
> if (inet_is_local_reserved_port(net, port))
> @@ -384,9 +388,11 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
> cond_resched();
> }
>
> - offset--;
> - if (!(offset & 1))
> - goto other_parity_scan;
> + if (!local_ports) {
> + offset--;
> + if (!(offset & 1))
> + goto other_parity_scan;
> + }
>
> if (attempt_half == 1) {
> /* OK we now try the upper half of the range */
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH bpf-next v2 00/15] bpf: A common way to attach struct_ops to a cgroup
From: Roman Gushchin @ 2026-06-26 23:59 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, eddyz87, memxor,
martin.lau, shakeel.butt, kuniyu, kerneljasonxing, kernel-team
In-Reply-To: <20260623175006.3136053-1-ameryhung@gmail.com>
Amery Hung <ameryhung@gmail.com> writes:
> Hi,
>
> I am continuing Martin's work to support attaching struct_ops to
> cgroup.
Awesome, thank you for working on this!
I'm going to rebase bpf oom work on top of this patchset and will give
it some additional testing.
Thanks!
^ permalink raw reply
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Jakub Kicinski @ 2026-06-27 0:33 UTC (permalink / raw)
To: Andrew Lunn
Cc: Maxime Chevallier, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <5b7dbdbc-93fd-4664-abad-0f47855fab55@lunn.ch>
On Fri, 26 Jun 2026 14:39:57 +0200 Andrew Lunn wrote:
> On Fri, Jun 26, 2026 at 10:33:50AM +0200, Maxime Chevallier wrote:
> >
> > > Sphinx follows pythons object orientate structure. So you could have a
> > > class test_ethtool_pause_advertising, with class documentation. And
> > > then methods within the class which are individual tests. The
> > > commented out section would then be method documentation.
> >
> > Good point, so maybe something along these lines :
> >
> > - A class for the test
> > - methods for indivitual tests
> > - For readability, I've written what the internal test helper would look
> > like (_adv_test), and how a test would look like without the helper in
> > adv_rx_on_tx_on().
> >
> > I'm already diving into coding, but it helps me a bit in the definition of the
> > "description" format :)
> >
> > this is what the class would look like :
>
> I like this :-)
This is very far from what existing python tests do in netdev.
I would prefer to stick to the "bash on steroids" use of Python.
Are you both familiar with the existing tests?
^ permalink raw reply
* Re: [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: Jason Xing @ 2026-06-27 0:36 UTC (permalink / raw)
To: Fijalkowski, Maciej, Roman Gushchin
Cc: Zaremba, Larysa, netdev@vger.kernel.org, bpf@vger.kernel.org,
Karlsson, Magnus, stfomichev@gmail.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, bjorn@kernel.org, Jason Xing
In-Reply-To: <IA1PR11MB60978AF3799FD895C53A860882EB2@IA1PR11MB6097.namprd11.prod.outlook.com>
On Fri, Jun 26, 2026 at 9:43 PM Fijalkowski, Maciej
<maciej.fijalkowski@intel.com> wrote:
>
> >
> > On Fri, Jun 26, 2026 at 7:12 PM Larysa Zaremba <larysa.zaremba@intel.com>
> > wrote:
> > >
> > > On Tue, Jun 23, 2026 at 03:32:34PM +0200, Maciej Fijalkowski wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch is inspired by the check[1] from sashiko. It says when
> > > > overflow happens, the address of cq to be published is invalid.
> > > > Actually the severer thing is the whole process of publishing the
> > > > address of cq in this particular case is not right: it should truely
> > > > publish the address and advance the cached_prod in cq as long as it
> > > > reads descriptors from txq.
> > > >
> > > > The following is the full analysis.
> > > > xsk_drop_skb() is called in three places, which all discard a partially
> > > > built multi-buffer skb:
> > > > 1) xsk_build_skb() -EOVERFLOW error path: packet exceeds
> > MAX_SKB_FRAGS
> > > > 2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
> > > > the TX ring prevents the partial packet from completing
> > > > 3) xsk_release(): socket close while xs->skb holds an incomplete packet
> > > >
> > > > In all three cases, the TX descriptors for the already-processed frags
> > > > have been consumed from the TX ring (xskq_cons_release), and CQ slots
> > > > have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
> > > > which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
> > > > the buffer addresses never appear in the completion queue, userspace
> > > > permanently loses track of these buffers.
> > > >
> > > > Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
> > > > destructor, which already submits buffer addresses to the CQ via
> > > > xsk_cq_submit_addr_locked().
> > > >
> > > > Note that cancelling the descriptors back to the TX ring (via
> > > > xskq_cons_cancel_n) is not a appropriate option because an oversized
> > > > packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
> > > > which is an obviously deadlock bug in the TX path.
> > > >
> > > > Also move the desc->addr assignment in xsk_build_skb() above the
> > > > overflow check so that the current descriptor's address is recorded
> > > > before a potential -EOVERFLOW jump to free_err, consistent with the
> > > > zerocopy path in xsk_build_skb_zerocopy().
> > > >
> > > > [1]:
> > https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > >
> > > This change looks good, but overflow case with only 1 descriptor worries
> > me.
> >
> > I presume you referred to xsk_build_skb_zerocopy()?
> >
> > > In such cases, once we get to following code, kfree_skb() has already
> > happened:
> > >
> > > if (err == -EOVERFLOW) {
> > > if (xs->skb) {
> > > /* Drop the packet */
> > > xsk_inc_num_desc(xs->skb);
> > > xsk_drop_skb(xs->skb);
> > > } else {
> > > xsk_cq_cancel_locked(xs->pool, 1);
> > > xs->tx->invalid_descs++;
> > > }
> > > xskq_cons_release(xs->tx);
> > > }
> > >
> > > kfree_skb() should have resulted in submission of the single fat descriptor to
> > > xsk_cq_submit_addr_locked() via xsk_destruct_skb(), so far consistent with
> > the
> >
> > At least, in the NO_LINEAR case, xsk_skb_init_misc() is not called
> > since the OVERFLOW skips this function, which means kfree_skb()
> > doesn't invoke xsk_destruct_skb() to publish it in the CQ. So it's
> > safe to cancel the cq reservation (in xsk_cq_cancel_locked(xs->pool,
> > 1)).
>
> (responding from outlook so apologies for any broken formatting)
>
> Yes, I have the same understanding here. However, how technically
> possible would it be to produce > MAX_SKB_FRAGS from a single
> AF_XDP descriptor?
Very unlikely. But my viewpoint might change after a wide deployment
internally in the second half of the year.
>
> I know Sashiko has pointed this out and you came up with previous
> fix, but for valid descriptor it is simply not possible. And invalid
> descs wouldn't reach this function.
Yep.
>
> I wouldn't like to stir up the pot too much so let us keep this
> code, but is there any way to give Sashiko additional context?
> I mean, for case where we would say *this can't happen*, will
> It be able to carry this information onwards?
I don't know about how sashiko works, sorry. Maybe @Roman Gushchin has
unique insights on this?
Thanks,
Jason
>
> >
> > Thanks,
> > Jason
> >
> > > multi-descriptor bevaior you are proposing here.
> > >
> > > But what happens when we cancel a submitted CQ slot via
> > > xsk_cq_cancel_locked(xs->pool, 1) in the above code?
> > >
> > > >
> > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx
> > path")
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > net/xdp/xsk.c | 13 ++++++++-----
> > > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index b970f30ea9b9..a7a83dc4546a 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff
> > *skb)
> > > >
> > > > static void xsk_drop_skb(struct sk_buff *skb)
> > > > {
> > > > - xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > - xsk_consume_skb(skb);
> > > > + struct xdp_sock *xs = xdp_sk(skb->sk);
> > > > +
> > > > + xs->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > + consume_skb(skb);
> > > > + xs->skb = NULL;
> > > > }
> > > >
> > > > static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> > > > @@ -877,7 +880,7 @@ static struct sk_buff
> > *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > return ERR_PTR(-ENOMEM);
> > > >
> > > > /* in case of -EOVERFLOW that could happen below,
> > > > - * xsk_consume_skb() will release this node as whole skb
> > > > + * xsk_drop_skb() will release this node as whole skb
> > > > * would be dropped, which implies freeing all list elements
> > > > */
> > > > xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > @@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > > goto free_err;
> > > > }
> > > >
> > > > + xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > +
> > > > if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) &&
> > xp_mb_desc(desc))) {
> > > > err = -EOVERFLOW;
> > > > goto free_err;
> > > > @@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > >
> > > > skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
> > > > refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
> > > > -
> > > > - xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > }
> > > > }
> > > >
> > > > --
> > > > 2.43.0
> > > >
> > > >
^ permalink raw reply
* Re: [PATCH v4] virtio_net: disable cb when NAPI is busy-polled
From: Jakub Kicinski @ 2026-06-27 0:44 UTC (permalink / raw)
To: Simon Horman, lange_tang
Cc: mst, xuanzhuo, jasowang, edumazet, virtualization, netdev,
tanglongjun
In-Reply-To: <20260626151508.1319440-1-horms@kernel.org>
On Fri, 26 Jun 2026 16:15:08 +0100 Simon Horman wrote:
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 26afa6341d161..c1e252400c0fc 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3011,6 +3011,8 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> > unsigned int xdp_xmit = 0;
> > bool napi_complete;
> >
> > + virtqueue_disable_cb(rq->vq);
> > +
>
> [Severity: High]
> Can this unconditionally disable the RX callback and cause a permanent network
> stall when polled by netpoll?
Good catch, Longjun just add if (budget)
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
From: Jakub Kicinski @ 2026-06-27 0:46 UTC (permalink / raw)
To: Carlo Szelinsky
Cc: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
Russell King, David S . Miller, Eric Dumazet, Paolo Abeni,
Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>
On Fri, 26 Jun 2026 18:59:26 +0200 Carlo Szelinsky wrote:
> Subject: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
## Form letter - net-next-closed
We have already submitted our pull request with net-next material for v7.2,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.
Please repost when net-next reopens after June 29th.
RFC patches sent for review only are obviously welcome at any time.
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer
pv-bot: closed
^ permalink raw reply
* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Jason Xing @ 2026-06-27 0:47 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
kuba, pabeni, horms, bjorn
In-Reply-To: <aj6mr-cIcF2Tg73r@devvm7509.cco0.facebook.com>
On Sat, Jun 27, 2026 at 12:30 AM Stanislav Fomichev
<sdf.kernel@gmail.com> wrote:
>
> On 06/26, Jason Xing wrote:
> > On Fri, Jun 26, 2026 at 12:05 AM Stanislav Fomichev
> > <sdf.kernel@gmail.com> wrote:
> > >
> > > On 06/25, Jason Xing wrote:
> > > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > > >
> > > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > > make userspace lose track of those buffers.
> > > > > > >
> > > > > > > The generic path fixes cover three related cases:
> > > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > > > continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > > > reports overflow;
> > > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > > > packet, including the offending invalid descriptor itself.
> > > > > > >
> > > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > > address area and published in completion order, after any earlier
> > > > > > > driver-visible Tx descriptors.
> > > > > > >
> > > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > > state local to the singular batched path, the series prevents a second
> > > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > > >
> > > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > > still not expecting those invalid packets on the Rx side.
> > > > > > >
> > > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > > against ice driver.
> > > > > >
> > > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > > question from me is: is it time to scrap our current approach where
> > > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > > half-consumed descriptor queues)
> > > > > >
> > > > > > Should we:
> > > > > >
> > > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > > and do a full stop here)
> > > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > > the cq space (if not -> EAGAIN)
> > > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > > locally, don't ever create semi-initialized skb)
> > > > > > 4. construct the skb
> > > > > > 5. xmit
> > > > >
> > > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > > previously being addressed by Jason.
> > > > >
> > > > > I believe planned logistics were to get these fixes onto net and then
> > > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > > -next and that's where we could address current flow.
> > > >
> > > > Agreed. That's what I'm hoping for. There would be much more
> > > > discussion on how to do batch xmit in an elegant way, I believe.
> > >
> > > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > > this non-zc in net, this is still technically fixes, not feature work..
> > >
> > > There was already a couple of revisions with this drain_cont approach
> > > and every time I look at it feels like the cure is worse than the
> > > decease :-( Obviously not gonna stop you from going with the current approach,
> > > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > > coming and we are piling more complexity).
> >
> > I see your point, but rewriting is something that cannot be easily
> > applied to the stable branches? Until now, we fix issues one by one
> > which have an explicit target branch (because of the fixes tag). Cross
> > fingers :(
> >
> > Sashiko has the magic to find out the hidden bugs more than ever and
> > AF_XDP is not the only place where a pile of reports are coming in.
>
> net vs net-next is fixes vs feature work. If we can't fix the current
> code, I think we can justify a rewrite using a better approach and
> route it via net. This series is 7 patches anyway, it's not like
> it is a quick short fix :-) But I'm ok with pushing it as it, I'm just
> trying to see if someone on your side is fed up with that part as well
> and wants to fix it "properly" :-p
>
> > My take is that batch xmit has been appending too long and at least so
> > far less and less bugs are found by sashiko. I believe if the mode is
> > changed to batch xmit, there are likely to be new and challenging
> > problems to discuss. I prefer to solve questions of the batch xmit
> > series.
>
> We can redo this part separately, without batching. Move from "read
> one chunk at a time" to "pre-read all chunks". Batching vs current issue
> are separate.
If the implementation of 'pre-read' is clear and simple, yes, it's a
better way. (I really want to engage myself in this right now, but
sorry, I can't since I'm writing many slides for Netdev.)
Probably Maciej will give it one last try; we'll see then.
>
> > BTW, would you both come to Netdev 0x1a next month? I believe we could
> > sit around the table and discuss some future plans there (in xdp
> > workshop?).
> > https://netdevconf.info/0x1A/sessions/workshop/xdp-workshop.html
>
> Yes, I plan to be there in person.
Great.
Thanks,
Jason
^ permalink raw reply
* Re: [PATCH net v2] octeontx2-pf: check DMAC extraction support before filtering
From: Harshitha Ramamurthy @ 2026-06-27 0:50 UTC (permalink / raw)
To: nshettyj
Cc: netdev, linux-kernel, sgoutham, gakula, sbhatta, hkelam,
bbhushan2, andrew+netdev, davem, edumazet, kuba, pabeni, naveenm,
tduszynski, sumang
In-Reply-To: <20260626062329.871990-1-nshettyj@marvell.com>
On Thu, Jun 25, 2026 at 11:24 PM <nshettyj@marvell.com> wrote:
>
> From: Suman Ghosh <sumang@marvell.com>
>
> Currently, configuring a VF MAC address via the PF (e.g., 'ip link
> set <pf> vf 0 mac <mac>') blindly attempts to install a DMAC-based
> hardware filter. However, the hardware parser profile might not
> support DMAC extraction.
>
> Check if the hardware parsing profile supports DMAC extraction
> before adding the filter. Additionally, emit a warning message
> to inform the operator if the MAC filter installation fails due
> to missing DMAC extraction support.
>
> Fixes: f0c2982aaf98 ("octeontx2-pf: Add support for SR-IOV management functions")
> Signed-off-by: Suman Ghosh <sumang@marvell.com>
> Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
>
> ---
> v2:
> - Move the DMAC extraction check from otx2_set_vf_mac() into
> otx2_do_set_vf_mac() which already holds pf->mbox.lock, so all
> mbox operations are under a single lock/unlock pair. All error
> paths now use the existing goto-out pattern, eliminating the
> scattered mutex_unlock() + return calls from v1.
> - Return -EOPNOTSUPP instead of 0 when DMAC extraction is not
> supported, so the caller gets an explicit error rather than a
> silent success.
Please ensure a minimum of 24 hr gap before posting a new revision and
also don't post patches in reply to a previous posting as documented
in:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html
> ---
> .../ethernet/marvell/octeontx2/nic/otx2_pf.c | 33 +++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> index b63df5737ff2..dc7e4a225dd0 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> @@ -2517,10 +2517,43 @@ EXPORT_SYMBOL(otx2_config_hwtstamp_set);
>
> static int otx2_do_set_vf_mac(struct otx2_nic *pf, int vf, const u8 *mac)
> {
> + struct npc_get_field_status_req *freq;
> + struct npc_get_field_status_rsp *frsp;
> struct npc_install_flow_req *req;
> int err;
>
> mutex_lock(&pf->mbox.lock);
> +
> + /* Skip installing the DMAC filter if the hardware parser profile
> + * does not support DMAC extraction.
> + */
> + freq = otx2_mbox_alloc_msg_npc_get_field_status(&pf->mbox);
> + if (!freq) {
> + err = -ENOMEM;
> + goto out;
> + }
I noticed that otx2_set_vf_mac() copies the MAC address into the vf
config structure before the programming is successful. Is that
intended?
> +
> + freq->field = NPC_DMAC;
> + if (otx2_sync_mbox_msg(&pf->mbox)) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + frsp = (struct npc_get_field_status_rsp *)otx2_mbox_get_rsp
> + (&pf->mbox.mbox, 0, &freq->hdr);
> + if (IS_ERR(frsp)) {
> + err = PTR_ERR(frsp);
> + goto out;
> + }
> +
> + if (!frsp->enable) {
> + netdev_warn(pf->netdev,
> + "VF %d MAC filter not installed: DMAC extraction not supported by parser profile\n",
> + vf);
Would a netdev_warn_ratelimited() be better here to avoid spamming the log?
> + err = -EOPNOTSUPP;
> + goto out;
> + }
> +
> req = otx2_mbox_alloc_msg_npc_install_flow(&pf->mbox);
> if (!req) {
> err = -ENOMEM;
> --
> 2.48.1
>
>
^ permalink raw reply
* Re: [PATCH] MAINTAINERS: Update Jason Wang's email address
From: Jakub Kicinski @ 2026-06-27 1:04 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, virtualization, netdev, eperezma, kvm, linux-kernel
In-Reply-To: <20260626022039.96139-1-jasowang@redhat.com>
On Fri, 26 Jun 2026 10:20:38 +0800 Jason Wang wrote:
> I will use jasowangio@gmail.com for future review and discussion
Do you want to add a mailmap entry, too?
Otherwise I think you'll get CCed twice (once for MAINTAINERS and once
because you given tags to previous changes)
^ permalink raw reply
* Re: [PATCH net-next] caif: annotate phyinfo lookup under config lock
From: Jakub Kicinski @ 2026-06-27 1:07 UTC (permalink / raw)
To: Runyu Xiao
Cc: davem, edumazet, pabeni, horms, netdev, linux-kernel, jianhao.xu
In-Reply-To: <20260626042440.2013499-1-runyu.xiao@seu.edu.cn>
On Fri, 26 Jun 2026 12:24:40 +0800 Runyu Xiao wrote:
> cfcnfg_get_phyinfo_rcu() is used by both RCU read-side paths and config
> update paths that hold cnfg->lock before adding or deleting entries from
> cnfg->phys. The helper walks the list with list_for_each_entry_rcu(),
> but does not tell lockdep about the config-lock-protected callers.
>
> Pass lockdep_is_held(&cnfg->lock) to the iterator. RCU-reader callers
> remain valid, and CONFIG_PROVE_RCU_LIST can now see the non-RCU
> protection used by the add/delete paths.
>
> This was found by our static analysis tool and then manually reviewed
> against the current tree. The dynamic triage evidence is a
> target-matched CONFIG_PROVE_RCU_LIST warning; the change is limited
> to documenting the existing protection contract.
This code was removed a couple of releases ago.
^ permalink raw reply
* Re: [PATCH] selftests: Open /dev/udmabuf O_RDONLY
From: Jakub Kicinski @ 2026-06-27 1:09 UTC (permalink / raw)
To: T.J. Mercier
Cc: kraxel, vivek.kasireddy, Shuah Khan, Andrew Lunn, David S. Miller,
Eric Dumazet, Paolo Abeni, linux-kselftest, linux-kernel, netdev,
bpf
In-Reply-To: <20260625181557.1086105-1-tjmercier@google.com>
On Thu, 25 Jun 2026 11:15:55 -0700 T.J. Mercier wrote:
> Write permissions on the /dev/udmabuf device file are not required to
> issue ioctls and allocate udmabufs. Applications should be opening this
> file as O_RDONLY. The BPF dmabuf_iter selftest already does this. [1]
>
> Remove the write access mode from the drivers/dma-buf/udmabuf.c and
> drivers/net/hw/ncdevmem.c selftests.
You need to explain "why", too. Why change it if it clearly
worked for everyone running this test until now.
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH v2] netdevsim: fix use-after-free in nsim_create and __nsim_dev_port_del
From: Jakub Kicinski @ 2026-06-27 1:48 UTC (permalink / raw)
To: Hrushiraj Gandhi
Cc: Simon Horman, Andrew Lunn, David S . Miller, Eric Dumazet,
Paolo Abeni, Jiri Pirko, netdev, linux-kernel, bpf,
syzbot+6c25f4750230faf70be9
In-Reply-To: <20260623144447.255326-1-hrushirajg23@gmail.com>
On Tue, 23 Jun 2026 20:14:47 +0530 Hrushiraj Gandhi wrote:
> Fix both paths by calling debugfs_remove_recursive() on the port's
> ddir before every free_netdev() call. The subsequent
> nsim_dev_port_debugfs_exit() calls become harmless no-ops since ddir is
> set to NULL.
Looks like the wrong fix. All features clean up after themselves with
the exception of ethtool. Save the ethtool ddir and remove just that
one. This will align with how the other features behave.
--
pw-bot: cr
^ permalink raw reply
* [PATCH net] net/smc: fix UAF in smc_cdc_rx_handler() by pinning the socket
From: Xiang Mei @ 2026-06-27 1:49 UTC (permalink / raw)
To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
Mahanta Jambigi, Tony Lu, Wen Gu, netdev
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Hans Wippel, linux-rdma, linux-s390, Weiming Shi,
Xiang Mei
smc_cdc_rx_handler() looks up the connection by token under the link
group's conns_lock, drops the lock, and then dereferences conn and the
smc_sock derived from it, ending in sock_hold(&smc->sk) inside
smc_cdc_msg_recv(). No reference is held across the lock release.
The only reference pinning the socket while the connection is
discoverable in the link group is taken in smc_lgr_register_conn()
(sock_hold) and dropped in __smc_lgr_unregister_conn() (sock_put), both
under conns_lock. Once the handler drops conns_lock, a concurrent
close() -> smc_release() -> smc_conn_free() -> smc_lgr_unregister_conn()
can drop that reference and free the smc_sock, so the handler's later
sock_hold() runs on freed memory:
WARNING: lib/refcount.c:25 at refcount_warn_saturate
Workqueue: rxe_wq do_work
refcount_warn_saturate (lib/refcount.c:25)
smc_cdc_msg_recv (net/smc/smc_cdc.c:430)
smc_cdc_rx_handler (net/smc/smc_cdc.c:502)
smc_wr_rx_tasklet_fn (net/smc/smc_wr.c:445)
tasklet_action_common (kernel/softirq.c:938)
handle_softirqs (kernel/softirq.c:622)
Kernel panic - not syncing: panic_on_warn set
Only SMC-R is affected. The SMC-D receive tasklet is stopped by
tasklet_kill(&conn->rx_tsklet) in smc_conn_free() before the connection
is unregistered, so it cannot run concurrently with the free.
Take the socket reference while still holding conns_lock, so the
registration reference can no longer be the last one, and drop it once
the handler is done.
Fixes: d7b0e37c1ac1 ("net/smc: restructure CDC message reception")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
net/smc/smc_cdc.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 619b3bab3824..b809139d7e87 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -483,21 +483,27 @@ static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf)
lgr = smc_get_lgr(link);
read_lock_bh(&lgr->conns_lock);
conn = smc_lgr_find_conn(ntohl(cdc->token), lgr);
+ if (conn && !conn->out_of_sync)
+ sock_hold(&container_of(conn, struct smc_sock, conn)->sk);
+ else
+ conn = NULL;
read_unlock_bh(&lgr->conns_lock);
- if (!conn || conn->out_of_sync)
+ if (!conn)
return;
smc = container_of(conn, struct smc_sock, conn);
if (cdc->prod_flags.failover_validation) {
smc_cdc_msg_validate(smc, cdc, link);
- return;
+ goto out;
}
if (smc_cdc_before(ntohs(cdc->seqno),
conn->local_rx_ctrl.seqno))
/* received seqno is old */
- return;
+ goto out;
smc_cdc_msg_recv(smc, cdc);
+out:
+ sock_put(&smc->sk);
}
static struct smc_wr_rx_handler smc_cdc_rx_handlers[] = {
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net v2] net: ipa: fix SMEM state handle leaks in SMP2P init
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: haoxiang_li2024
Cc: elder, andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
linux-kernel, stable
In-Reply-To: <20260624065955.2822765-1-haoxiang_li2024@163.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 14:59:55 +0800 you wrote:
> ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
> qcom_smem_state_get(). However, neither the init error paths
> nor ipa_smp2p_exit() release them.
>
> Release both handles with qcom_smem_state_put() in the init
> error paths and in ipa_smp2p_exit().
>
> [...]
Here is the summary with links:
- [net,v2] net: ipa: fix SMEM state handle leaks in SMP2P init
https://git.kernel.org/netdev/net/c/96ca1e658ae4
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] net: liquidio: fix BAR resource leak on PF number failure
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: haoxiang_li2024
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, ricardo.farrington,
felix.manlunas, horms, netdev, linux-kernel, stable
In-Reply-To: <20260624064013.2809570-1-haoxiang_li2024@163.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 14:40:13 +0800 you wrote:
> If cn23xx_get_pf_num() fails, the function returns without
> unmapping either BAR. Unmap both BARs before returning from
> the error path.
>
> Found by manual code review.
>
> Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
>
> [...]
Here is the summary with links:
- [net,v2] net: liquidio: fix BAR resource leak on PF number failure
https://git.kernel.org/netdev/net/c/c63ee62a3c4a
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] net: pse-pd: scope pse_control regulator handle to kref lifetime
From: patchwork-bot+netdevbpf @ 2026-06-27 1:50 UTC (permalink / raw)
To: Carlo Szelinsky
Cc: o.rempel, kory.maincent, andrew+netdev, davem, edumazet, kuba,
pabeni, horms, corey, hkallweit1, linux, netdev, linux-kernel
In-Reply-To: <20260624204017.2752934-1-github@szelinsky.de>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 24 Jun 2026 22:40:16 +0200 you wrote:
> From: Corey Leavitt <corey@leavitt.info>
>
> __pse_control_release() drops psec->ps via devm_regulator_put(), which
> only succeeds if the devres entry added by the matching
> devm_regulator_get_exclusive() is still present on pcdev->dev at the
> time the pse_control's kref hits zero.
>
> [...]
Here is the summary with links:
- [net,v2] net: pse-pd: scope pse_control regulator handle to kref lifetime
https://git.kernel.org/netdev/net/c/16759757c4d2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: luoxuanqiang @ 2026-06-27 1:59 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Eric Dumazet, Neal Cardwell, netdev, David S . Miller,
Jakub Kicinski, Paolo Abeni, Simon Horman, luoxuanqiang
In-Reply-To: <CAAVpQUCayy3o59i2vh9hHRPi-3pw1BJgEYMwZYRpZnYEUoqsGw@mail.gmail.com>
> 2026年6月27日 07:40,Kuniyuki Iwashima <kuniyu@google.com> 写道:
>
> On Fri, Jun 26, 2026 at 2:40 AM <xuanqiang.luo@linux.dev> wrote:
>>
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>>
>> IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
>> range on a per-socket basis. __inet_hash_connect() already treats such a
>> range as an explicit application partition and scans it with step 1 [1].
>>
>> Do the same in inet_csk_find_open_port():
>
> What's the use case of IP_LOCAL_PORT_RANGE + bind(, 0)
> without IP_BIND_ADDRESS_NO_PORT ?
Hi Kuniyuki,
Thanks for the question!
The use case is when an application wants to restrict ephemeral port
allocation to a socket-local IP_LOCAL_PORT_RANGE, but still needs
bind(..., 0) to allocate and reserve a local port immediately.
IP_BIND_ADDRESS_NO_PORT is useful when the application can defer port
allocation until connect(), but it changes this behavior: bind(..., 0)
does not reserve a port in that case. So it is not a replacement for
applications that need the local port before connect(), for example to
publish it to another component or set up local policy.
This patch is also intended to keep the bind(..., 0) path consistent with
Eric's earlier change in __inet_hash_connect().
Thanks,
Xuanqiang
^ permalink raw reply
* Re: [PATCH] qede: fix out-of-bounds check for cqe->len_list[]
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Matvey Kovalev
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Pavel.Zhigulin,
netdev, linux-kernel, lvc-project
In-Reply-To: <20260623144602.3521-1-matvey.kovalev@ispras.ru>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 23 Jun 2026 17:45:54 +0300 you wrote:
> Move index check before element access.
>
> Fixes: 896f1a2493b5 ("net: qlogic/qede: fix potential out-of-bounds read in qede_tpa_cont() and qede_tpa_end()")
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>
> Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
>
> [...]
Here is the summary with links:
- qede: fix out-of-bounds check for cqe->len_list[]
https://git.kernel.org/netdev/net/c/f9ba47fce593
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH 0/2] net/sched: finish the qdisc_dequeue_peeked conversion (taprio, multiq)
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Bryam Vargas
Cc: vinicius.gomes, pabeni, jhs, jiri, kuba, davem, edumazet, horms,
netdev, jarkao2, vladimir.oltean, linux-kernel
In-Reply-To: <20260625-b4-disp-31bcb279-v1-0-85c40b83c529@proton.me>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 25 Jun 2026 04:51:18 -0500 you wrote:
> Commit 77be155cba4e added peek emulation: a non-work-conserving qdisc's
> ->peek dequeues one skb and stashes it in the child's gso_skb. A parent
> that peeks such a child must then take the packet with
> qdisc_dequeue_peeked(), not a direct ->dequeue(), or the stashed skb is
> bypassed and the child's qlen/backlog desync. sch_red and sch_sfb were
> just fixed for this; taprio and multiq still take the direct path.
>
> [...]
Here is the summary with links:
- [1/2] net/sched: sch_taprio: Replace direct dequeue call with peek and qdisc_dequeue_peeked
https://git.kernel.org/netdev/net/c/e056e1dfcddc
- [2/2] net/sched: sch_multiq: Replace direct dequeue call with peek and qdisc_dequeue_peeked
https://git.kernel.org/netdev/net/c/54f6b0c843e2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] seg6: validate SRH length before reading fixed fields
From: patchwork-bot+netdevbpf @ 2026-06-27 2:00 UTC (permalink / raw)
To: Nuoqi Gui
Cc: davem, edumazet, kuba, pabeni, horms, andrea.mayer, netdev, bpf,
linux-kernel, m.xhonneux, daniel, dlebrun
In-Reply-To: <20260623-f01-17-seg6-srh-len-v2-1-2edc40e9e3e1@mails.tsinghua.edu.cn>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 23 Jun 2026 18:32:31 +0800 you wrote:
> seg6_validate_srh() reads fixed SRH fields such as srh->type and
> srh->hdrlen before checking that the supplied length covers the fixed
> struct ipv6_sr_hdr fields.
>
> The BPF SEG6 encap path reaches this with a BPF program-supplied pointer
> and length: bpf_lwt_push_encap() and the SEG6 local BPF END_B6 and
> END_B6_ENCAP actions call bpf_push_seg6_encap(), which forwards the
> length to seg6_validate_srh() with no minimum-size guard. A 2-byte SEG6
> encap header can therefore make the validator read srh->type at offset 2
> beyond the caller-supplied buffer.
>
> [...]
Here is the summary with links:
- [net,v2] seg6: validate SRH length before reading fixed fields
https://git.kernel.org/netdev/net/c/a75d99f46bf2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox