Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net PATCH v2] octeontx2-af: mcs: Fix unsupported secy stats read
From: Subbaraya Sundeep @ 2026-06-16 19:00 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, rkannoth
  Cc: netdev, linux-kernel, Subbaraya Sundeep

From: Geetha sowjanya <gakula@marvell.com>

Secy control stats counter doesn't exist for CNF10KB platform.
Skip reading this respective register for CNF10KB silicon while
fetching secy stats.

Fixes: 9312150af8da ("octeontx2-af: cn10k: mcs: Support for stats collection")
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
---
v2 changes:
 Fixed AI review by modifying debugfs also NOT to access
 Secy control stats counter

 drivers/net/ethernet/marvell/octeontx2/af/mcs.c         | 6 +++---
 drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mcs.c b/drivers/net/ethernet/marvell/octeontx2/af/mcs.c
index c1775bd01c2b..a07e0b3d8d00 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mcs.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mcs.c
@@ -120,13 +120,13 @@ void mcs_get_rx_secy_stats(struct mcs *mcs, struct mcs_secy_stats *stats, int id
 	reg = MCSX_CSE_RX_MEM_SLAVE_INPKTSSECYUNTAGGEDX(id);
 	stats->pkt_untaged_cnt = mcs_reg_read(mcs, reg);
 
-	reg = MCSX_CSE_RX_MEM_SLAVE_INPKTSSECYCTLX(id);
-	stats->pkt_ctl_cnt = mcs_reg_read(mcs, reg);
-
 	if (mcs->hw->mcs_blks > 1) {
 		reg = MCSX_CSE_RX_MEM_SLAVE_INPKTSSECYNOTAGX(id);
 		stats->pkt_notag_cnt = mcs_reg_read(mcs, reg);
+		return;
 	}
+	reg = MCSX_CSE_RX_MEM_SLAVE_INPKTSSECYCTLX(id);
+	stats->pkt_ctl_cnt = mcs_reg_read(mcs, reg);
 }
 
 void mcs_get_flowid_stats(struct mcs *mcs, struct mcs_flowid_stats *stats,
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
index fa461489acdd..ca2704b188a5 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
@@ -482,10 +482,11 @@ static int rvu_dbg_mcs_rx_secy_stats_display(struct seq_file *filp, void *unused
 		seq_printf(filp, "secy%d: Tagged ctrl pkts: %lld\n", secy_id,
 			   stats.pkt_tagged_ctl_cnt);
 		seq_printf(filp, "secy%d: Untaged pkts: %lld\n", secy_id, stats.pkt_untaged_cnt);
-		seq_printf(filp, "secy%d: Ctrl pkts: %lld\n", secy_id, stats.pkt_ctl_cnt);
 		if (mcs->hw->mcs_blks > 1)
 			seq_printf(filp, "secy%d: pkts notag: %lld\n", secy_id,
 				   stats.pkt_notag_cnt);
+		else
+			seq_printf(filp, "secy%d: Ctrl pkts: %lld\n", secy_id, stats.pkt_ctl_cnt);
 	}
 	mutex_unlock(&mcs->stats_lock);
 	return 0;
-- 
2.48.1


^ permalink raw reply related

* [PATCH] octeontx2-pf: Clear stats of all resources when freeing resources
From: Subbaraya Sundeep @ 2026-06-16 19:00 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, rkannoth
  Cc: netdev, linux-kernel, Subbaraya Sundeep
In-Reply-To: <1781636420-19816-1-git-send-email-sbhatta@marvell.com>

When all MCS resources mapped to a PF are being freed then clear
stats of all those resources too.

Fixes: 815debbbf7b5 ("octeontx2-pf: mcs: Clear stats before freeing resource")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
---
 drivers/net/ethernet/marvell/octeontx2/nic/cn10k_macsec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/cn10k_macsec.c b/drivers/net/ethernet/marvell/octeontx2/nic/cn10k_macsec.c
index 4d3a7f4be962..9524d38f1582 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/cn10k_macsec.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/cn10k_macsec.c
@@ -182,6 +182,7 @@ static void cn10k_mcs_free_rsrc(struct otx2_nic *pfvf, enum mcs_direction dir,
 	clear_req->id = hw_rsrc_id;
 	clear_req->type = type;
 	clear_req->dir = dir;
+	clear_req->all = all;
 
 	req = otx2_mbox_alloc_msg_mcs_free_resources(mbox);
 	if (!req)
-- 
2.48.1


^ permalink raw reply related

* [PATCH 6.1] net: gro: don't merge zcopy skbs
From: Alexander Martyniuk @ 2026-06-16 22:00 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Alexander Martyniuk, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sasha Levin, Sabrina Dubroca,
	Hyunwoo Kim, Pavel Begunkov, netdev, linux-kernel, lvc-project,
	Huzaifa Sidhpurwala, Willem de Bruijn

From: Sabrina Dubroca <sd@queasysnail.net>

commit 4db79a322db8c97f7b73b8a347395ef4d685eb40 upstream.

skb_gro_receive() can currently copy frags between the source and GRO
skb, without checking the zerocopy status, and in particular the
SKBFL_MANAGED_FRAG_REFS flag.

When SKBFL_MANAGED_FRAG_REFS is set, the skb doesn't hold a reference
on the pages in shinfo->frags. Appending those frags to another skb's
frags without fixing up the page refcount can lead to UAF.

When either the last skb in the GRO chain (the one we would append
frags to) or the source skb is zerocopy, don't merge the skbs.

Fixes: 753f1ca4e1e5 ("net: introduce managed frags infrastructure")
Reported-by: Huzaifa Sidhpurwala <huzaifas@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/c3b7f906bbfcbdfd7b4fa9d6c18a438870df85be.1779307748.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Martyniuk <alexevgmart@gmail.com>
---
Backport fix for CVE-2026-46323
 net/core/gro.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/gro.c b/net/core/gro.c
index ea6571c01faa..c5a9733d929a 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -171,6 +171,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 	if (p->pp_recycle != skb->pp_recycle)
 		return -ETOOMANYREFS;
 
+	if (skb_zcopy(p) || skb_zcopy(skb))
+		return -ETOOMANYREFS;
+
 	/* pairs with WRITE_ONCE() in netif_set_gro_max_size() */
 	gro_max_size = READ_ONCE(p->dev->gro_max_size);
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH v1 net-next] ipv4: fib_rule: Move fib4_rules_exit() to ->exit().
From: Kuniyuki Iwashima @ 2026-06-16 19:13 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kuniyuki Iwashima, netdev,
	syzbot+965506b59a2de0b6905c

syzbot reported use-after-free of net->ipv4.rules_ops. [0]

It can be reproduced with these commands:

  while true; do
  	ip netns add ns1
  	ip -n ns1 link set dev lo up
  	ip -n ns1 address add 192.0.2.1/24 dev lo
  	ip -n ns1 link add name dummy1 up type dummy
  	ip -n ns1 address add 198.51.100.1/24 dev dummy1
  	ip -n ns1 rule add ipproto tcp sport 12345 table 12345
  	ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
  	ip netns del ns1
  done

The cited commit moved fib4_rules_exit() earlier to ->exit_rtnl(),
but the kernel socket destroyed in ->exit() could eventually reach
__fib_lookup().

I left fib4_rules_exit() in ->exit_rtnl() because fib4_rule_delete()
calls fib_unmerge(), which requires RTNL.

However, when ->delete() is called, ->configure() has already been
called, thus fib_unmerge() in ->delete() has no effect.

Let's remove fib_unmerge() in fib4_rule_delete() and move
fib4_rules_exit() to ->exit().

Many thanks to Ido Schimmel for providing the nice repro very quickly.

Note that we can make fib_rules_ops.delete() return void once
net-next opens.

[0]:
BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641

CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
 __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96
 ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811
 ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702
 __ip_route_output_key include/net/route.h:169 [inline]
 ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929
 ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118
 release_sock+0x206/0x260 net/core/sock.c:3861
 inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950
 udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
 fou_release net/ipv4/fou_core.c:562 [inline]
 fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230
 ops_exit_list net/core/net_namespace.c:199 [inline]
 ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
 cleanup_net+0x572/0x810 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
 worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Fixes: 759923cf03b0 ("ipv4: fib: Convert fib_net_exit_batch() to ->exit_rtnl().")
Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a315824.b0403584.28d0ff.0000.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
 net/ipv4/fib_frontend.c | 10 ++++++----
 net/ipv4/fib_rules.c    | 11 ++---------
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c7d1f31650d7..42212970d735 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1612,10 +1612,6 @@ static void ip_fib_net_exit(struct net *net)
 			fib_free_table(tb);
 		}
 	}
-
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	fib4_rules_exit(net);
-#endif
 }
 
 static int __net_init fib_net_init(struct net *net)
@@ -1652,6 +1648,9 @@ static int __net_init fib_net_init(struct net *net)
 	ip_fib_net_exit(net);
 	rtnl_net_unlock(net);
 
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+	fib4_rules_exit(net);
+#endif
 	kfree(net->ipv4.fib_table_hash);
 	fib4_notifier_exit(net);
 	goto out;
@@ -1671,6 +1670,9 @@ static void __net_exit fib_net_exit_rtnl(struct net *net,
 
 static void __net_exit fib_net_exit(struct net *net)
 {
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+	fib4_rules_exit(net);
+#endif
 	kfree(net->ipv4.fib_table_hash);
 	fib4_notifier_exit(net);
 	fib4_semantics_exit(net);
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 51f0193092f0..e068a5bace73 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -352,24 +352,17 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 static int fib4_rule_delete(struct fib_rule *rule)
 {
 	struct net *net = rule->fr_net;
-	int err;
-
-	/* split local/main if they are not already split */
-	err = fib_unmerge(net);
-	if (err)
-		goto errout;
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	if (((struct fib4_rule *)rule)->tclassid)
 		atomic_dec(&net->ipv4.fib_num_tclassid_users);
 #endif
-	net->ipv4.fib_has_custom_rules = true;
 
 	if (net->ipv4.fib_rules_require_fldissect &&
 	    fib_rule_requires_fldissect(rule))
 		net->ipv4.fib_rules_require_fldissect--;
-errout:
-	return err;
+
+	return 0;
 }
 
 static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* Re: [Bug] incompatibility between 'e1000e' and Aruba AOS-CX switches (too small inter-packet gap)
From: Andrew Lunn @ 2026-06-16 19:34 UTC (permalink / raw)
  To: Philippe Andersson; +Cc: netdev, Ludovic Calmant, Fabian Noël
In-Reply-To: <457d1617-bd7f-44c5-a9af-7ba8aa9250f4@iba-group.com>

> A support ticket has already been opened with Aruba, but it's unclear at
> this stage that the problem is on their side.

How easy is it to reproduce? Can you run a git bisect from the last
known good kernel version to the first known bad version?

      Andrew

^ permalink raw reply

* Re: [PATCH v27 3/5] cxl/sfc: Initialize dpa without a mailbox
From: Dan Williams (nvidia) @ 2026-06-16 19:35 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams (nvidia),
	alejandro.lucero-palau, linux-cxl, netdev, edward.cree, davem,
	kuba, pabeni, edumazet, dave.jiang
  Cc: Dan Williams, Ben Cheatham, Jonathan Cameron
In-Reply-To: <17b68fb1-768e-49f6-884d-49e0952621b8@amd.com>

Alejandro Lucero Palau wrote:
> 
> On 6/10/26 00:24, Dan Williams (nvidia) wrote:
> > alejandro.lucero-palau@ wrote:
> >> From: Alejandro Lucero <alucerop@amd.com>
> >>
> >> Type3 relies on mailbox CXL_MBOX_OP_IDENTIFY command for initializing
> >> memdev state params which end up being used for DPA initialization.
> >>
> >> Allow a Type2 driver to initialize DPA simply by giving the size of its
> >> volatile hardware partition.
> >>
> >> Move related functions to memdev.
> > The code movement is not strictly necessary. Just add cxl_set_capacity()
> > and we can consider a move later if mbox.o and memdev.o are ever not
> > both included in cxl_core.o by default.
> 
> 
> I think it is the right thing to do as the new function uses add_part() 
> (moved) and the other add_part() client is the other function moved, 
> cxl_mem_dpa_fetch().
> 
> Note cxl_mem_get_partition() used by cxl_mem_dpa_fetch() is the one 
> working with mbox commands and it remains in the same place inside 
> core/mbox.c and the only cxl_mem_dpa_fetch() client is cxl/pci.c
> 
> 
> This was reviewed and accepted so no reason for not doing it ...

Sure, I am ok to let it go as is.

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: Guard conntrack opts error writes
From: Alexei Starovoitov @ 2026-06-16 19:36 UTC (permalink / raw)
  To: Yiyang Chen, bpf, netfilter-devel
  Cc: pablo, fw, phil, davem, edumazet, kuba, pabeni, horms, andrii,
	eddyz87, ast, daniel, memxor, martin.lau, song, yonghong.song,
	jolsa, emil, shuah, kartikey406, coreteam, netdev, linux-kernel,
	linux-kselftest
In-Reply-To: <70aeec0ab762aebe65129cf6052e132c7329edc2.1781586477.git.chenyy23@mails.tsinghua.edu.cn>

On Mon Jun 15, 2026 at 10:42 PM PDT, Yiyang Chen wrote:
> The conntrack lookup and allocation kfuncs take an opts pointer
> together with an opts__sz argument. The verifier checks only the memory
> range described by opts__sz, but the wrappers unconditionally write
> opts->error whenever the internal lookup or allocation helper returns an
> error.
>
> For an invalid size smaller than the end of opts->error, that write can
> land outside the verifier-checked range. Keep returning NULL for invalid
> arguments, but only report the error through opts->error when the
> supplied size includes the field.
>
> This preserves error reporting for the supported 12-byte and 16-byte
> layouts, and for other invalid sizes that still include opts->error.
>
> Fixes: b4c2b9593a1c ("net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF")
> Fixes: d7e79c97c00c ("net: netfilter: Add kfuncs to allocate and insert CT")
> Signed-off-by: Yiyang Chen <chenyy23@mails.tsinghua.edu.cn>
> ---
>  net/netfilter/nf_conntrack_bpf.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c
> index 40c261cd0af38..3c182024ec509 100644
> --- a/net/netfilter/nf_conntrack_bpf.c
> +++ b/net/netfilter/nf_conntrack_bpf.c
> @@ -65,6 +65,11 @@ enum {
>  	NF_BPF_CT_OPTS_SZ = 16,
>  };
>  
> +static bool bpf_ct_opts_has_error(u32 opts_len)
> +{
> +	return opts_len >= offsetofend(struct bpf_ct_opts, error);
> +}
> +
>  static int bpf_nf_ct_tuple_parse(struct bpf_sock_tuple *bpf_tuple,
>  				 u32 tuple_len, u8 protonum, u8 dir,
>  				 struct nf_conntrack_tuple *tuple)
> @@ -298,7 +303,8 @@ bpf_xdp_ct_alloc(struct xdp_md *xdp_ctx, struct bpf_sock_tuple *bpf_tuple,
>  	nfct = __bpf_nf_ct_alloc_entry(dev_net(ctx->rxq->dev), bpf_tuple, tuple__sz,
>  				       opts, opts__sz, 10);
>  	if (IS_ERR(nfct)) {
> -		opts->error = PTR_ERR(nfct);
> +		if (bpf_ct_opts_has_error(opts__sz))
> +			opts->error = PTR_ERR(nfct);

LLMs have no taste.

Above two lines could have been one helper
   bpf_ct_opts_set_error(opts, opts__sz, PTR_ERR(nfct));

Or we can do a step further and simplify the code more.
Turn this:
   if (IS_ERR(nfct)) {
           opts->error = PTR_ERR(nfct);
           return NULL;
   }
   return (struct nf_conn___init *)nfct;
into:
   return (struct nf_conn___init *)bpf_ct_opts_result(opts, opts__sz, nfct);

static void *bpf_ct_opts_result(struct bpf_ct_opts *opts, u32 opts__sz, void *ret)
{
  if (!IS_ERR(ret))
    return ret;
  if (opts__sz >= offsetofend(struct bpf_ct_opts, error))
    opts->error = PTR_ERR(ret);
  return NULL;
}

This kind of small improvements should be obvious to any human developer.
Please do NOT send us patches straight out of LLM.
Review it first and think how to improve it.

pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v6 1/2] dinghai: add ZTE network driver support
From: Andrew Lunn @ 2026-06-16 19:39 UTC (permalink / raw)
  To: han.junyang
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms, linux-kernel,
	netdev, ran.ming, han.chengfei, zhang.yanze
In-Reply-To: <20260616213057452I2KLm3mVgWYl_SUTy_YYS@zte.com.cn>

> +++ b/drivers/net/ethernet/zte/dinghai/en_pf.h
> +static inline void *dh_core_alloc_priv(struct dh_core_dev *dh_dev,
> +				       size_t size)
> +{
> +	void *priv = kzalloc(size, GFP_KERNEL);
> +
> +	if (priv)
> +		dh_dev->priv = priv;
> +	return priv;
> +}
> +
> +static inline void dh_core_free_priv(struct dh_core_dev *dh_dev)
> +{
> +	kfree(dh_dev->priv);
> +}

It is unusual for these to be inline functions in a header. Why is
this?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next v6 2/2] dinghai: add hardware register access and PCI? capability scanning
From: Andrew Lunn @ 2026-06-16 19:49 UTC (permalink / raw)
  To: han.junyang
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms, linux-kernel,
	netdev, ran.ming, han.chengfei, zhang.yanze
In-Reply-To: <20260616213550502kLzSZF2DiQyd9Dl0Dv0Gz@zte.com.cn>

> +int zxdh_pf_common_cfg_init(struct dh_core_dev *dh_dev)
> +{
> +	struct zxdh_pf_device *pf_dev = dh_dev->priv;
> +	struct pci_dev *pdev = dh_dev->pdev;
> +	int common;
> +
> +	/* check for a common config: if not, use legacy mode (bar 0). */
> +	common = zxdh_pf_pci_find_capability(pdev, ZXDH_PCI_CAP_COMMON_CFG,
> +					     IORESOURCE_IO | IORESOURCE_MEM,
> +					     &pf_dev->modern_bars);
> +	if (common == 0) {
> +		dev_err(dh_dev->device,
> +			"missing capabilities %i, leaving for legacy driver\n",
> +			common);

That looks double odd. Normally you would use !common. Also, you know
common is 0, so why use "%i", when it could be just '0'.

> +int zxdh_pf_notify_cfg_init(struct dh_core_dev *dh_dev)
> +{
> +	struct zxdh_pf_device *pf_dev = dh_dev->priv;
> +	struct pci_dev *pdev = dh_dev->pdev;
> +	u32 notify_length;
> +	u32 notify_offset;
> +	int notify;
> +
> +	/* If common is there, these should be too... */
> +	notify = zxdh_pf_pci_find_capability(pdev, ZXDH_PCI_CAP_NOTIFY_CFG,
> +					     IORESOURCE_IO | IORESOURCE_MEM,
> +					     &pf_dev->modern_bars);
> +	if (notify == 0) {
> +		dev_err(dh_dev->device, "missing capabilities %i\n", notify);
> +		return -EINVAL;
> +	}
> +

Same again.

    Andrew

---
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v27 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Dan Williams (nvidia) @ 2026-06-16 19:51 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams (nvidia),
	alejandro.lucero-palau, linux-cxl, netdev, edward.cree, davem,
	kuba, pabeni, edumazet, dave.jiang
In-Reply-To: <50d8e423-8248-4e26-901b-010d14d22e67@amd.com>

Alejandro Lucero Palau wrote:
> 
> On 6/10/26 14:56, Alejandro Lucero Palau wrote:
> >
> > On 6/10/26 07:10, Alejandro Lucero Palau wrote:
> >>
> >> On 6/10/26 00:30, Dan Williams (nvidia) wrote:
> >>> alejandro.lucero-palau@ wrote:
> >>>> From: Alejandro Lucero <alucerop@amd.com>
> >>>>
> >>>> Use core API for safely obtain the CXL range linked to an HDM 
> >>>> committed
> >>>> by the BIOS. Map such a range for being used as the ctpio buffer.
> >>>>
> >>>> A potential user space action through sysfs unbinding or core cxl
> >>>> modules remove will trigger sfc driver device detachment, with that 
> >>>> case
> >>>> not racing with this mapping as this is done during driver probe and
> >>>> therefore protected with device lock against those user space actions.
> >>>>
> >>>> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> >>>> ---
> >>>>   drivers/net/ethernet/sfc/efx.c     |  1 +
> >>>>   drivers/net/ethernet/sfc/efx_cxl.c | 24 ++++++++++++++++++++++++
> >>>>   drivers/net/ethernet/sfc/efx_cxl.h |  3 +++
> >>>>   3 files changed, 28 insertions(+)
> >>>>
> >>>> diff --git a/drivers/net/ethernet/sfc/efx.c 
> >>>> b/drivers/net/ethernet/sfc/efx.c
> >>>> index 90ccbe310386..578054c21e79 100644
> >>>> --- a/drivers/net/ethernet/sfc/efx.c
> >>>> +++ b/drivers/net/ethernet/sfc/efx.c
> >>>> @@ -984,6 +984,7 @@ static void efx_pci_remove(struct pci_dev 
> >>>> *pci_dev)
> >>>>       efx_fini_io(efx);
> >>>>         probe_data = container_of(efx, struct efx_probe_data, efx);
> >>>> +    efx_cxl_exit(probe_data);
> >>>>         pci_dbg(efx->pci_dev, "shutdown successful\n");
> >>>>   diff --git a/drivers/net/ethernet/sfc/efx_cxl.c 
> >>>> b/drivers/net/ethernet/sfc/efx_cxl.c
> >>>> index 4d55c08cf2a1..d5766a40e2cf 100644
> >>>> --- a/drivers/net/ethernet/sfc/efx_cxl.c
> >>>> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
> >>>> @@ -18,6 +18,7 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
> >>>>   {
> >>>>       struct efx_nic *efx = &probe_data->efx;
> >>>>       struct pci_dev *pci_dev = efx->pci_dev;
> >>>> +    struct range cxl_pio_range;
> >>>>       struct efx_cxl *cxl;
> >>>>       u16 dvsec;
> >>>>       int rc;
> >>>> @@ -75,9 +76,32 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
> >>>>           return -ENODEV;
> >>>>       }
> >>>>   +    cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
> >>>> +    if (IS_ERR(cxl->cxlmd)) {
> >>>> +        pci_err(pci_dev, "CXL accel memdev creation failed\n");
> >>>> +        return PTR_ERR(cxl->cxlmd);
> >>>> +    }
> >>>> +
> >>>> +    cxl->ctpio_cxl = ioremap_wc(cxl_pio_range.start,
> >>>> +                    range_len(&cxl_pio_range));
> >>>> +    if (!cxl->ctpio_cxl) {
> >>>> +        pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
> >>>> +            &cxl_pio_range);
> >>>> +        return -ENOMEM;
> >>> Dave caught the iounmap leak, but another concern is since you want to
> >>> continue operation if efx_cxl_init() fails then you probably also want
> >>> to release the successful attachment to the CXL domain if this happens.
> >>
> >>
> >> I will do that.
> >>
> >
> > Looking at this issue, I think an error when creating the memdev or 
> > during the region attach triggers the memdev removal, but ...
> >
> >
> >>
> >>> Minor since something else is likely to fail if ioremap is not 
> >>> reliable.
> >
> >
> > .. if we want to specifically do that with an unlikely (but possible) 
> > ioremap error something else needs to be exported like 
> > cxl_memdev_unregister(). Are you happy with that approach?
> >
> 
> I have just tested with this:
> 
> +void cxl_memdev_remove(void *_cxlmd)
> +{
> +       struct cxl_memdev *cxlmd = _cxlmd;
> +       struct device *dev = &cxlmd->dev;
> +
> +       devm_remove_action_nowarn(cxlmd->cxlds->dev, cxl_memdev_unregister,
> +                                 cxlmd);
> +
> +       cdev_device_del(&cxlmd->cdev, dev);
> +       cxl_memdev_shutdown(dev);
> +       put_device(dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_memdev_remove, "CXL");
> 
> 
> only called if the ioremap fails.
> 
> 
> Please, let me know if you like this approach before sending another 
> version.

A devres group can automatically cleanup after devm_cxl_memdev_probe()
in the error path with no new exports needed from the CXL core.
Something like:

        void *group = devres_open_group(cxl->cxlds.dev, NULL, GFP_KERNEL);
        int rc = 0;

        if (!group)
                return -ENOMEM;
        
        cxl->cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &cxl_pio_range);
        if (IS_ERR(cxl->cxlmd)) {
                pci_err(pci_dev, "CXL accel memdev creation failed\n");
                rc = PTR_ERR(cxl->cxlmd);
                goto out;
        }

        cxl->ctpio_cxl =
                ioremap_wc(cxl_pio_range.start, range_len(&cxl_pio_range));
        if (!cxl->ctpio_cxl) {
                pci_err(pci_dev, "CXL ioremap region (%pra) failed\n",
                        &cxl_pio_range);
                rc = -ENOMEM;
        }

out:
        if (rc)
                devres_release_group(group);
        else
                devres_remove_group(group);
        return rc;

^ permalink raw reply

* Landlock: LANDLOCK_ACCESS_NET_CONNECT_TCP bypass via TCP Fast Open
From: Bryam Vargas @ 2026-06-16 20:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, Matthieu Buffet, Paul Moore, Eric Dumazet,
	Neal Cardwell, linux-security-module, netdev, linux-kernel

Hello Mickaël, and Landlock folks,

A task confined by a Landlock ruleset that handles
LANDLOCK_ACCESS_NET_CONNECT_TCP and is denied connecting to a given port can
still establish a TCP connection to that port by using TCP Fast Open, i.e.
sendto(fd, ..., MSG_FASTOPEN, &dst, dstlen) on a fresh stream socket. The
network-egress confinement for TCP connect is silently bypassed.

Affected
--------
Any kernel with CONFIG_SECURITY_LANDLOCK=y and Landlock enabled that supports
the TCP network access rights (Landlock ABI >= 4, since Linux 6.7). Confirmed by
source inspection on mainline (v7.1-rc7) and reproduced on Linux 7.0.11
(Landlock ABI 8). No CONFIG beyond Landlock + IPv4/IPv6 TCP; TCP Fast Open client
is enabled by the per-netns default (net.ipv4.tcp_fastopen has TFO_CLIENT_ENABLE
set), so no sysctl change and no setsockopt are required.

Root cause
----------
LANDLOCK_ACCESS_NET_CONNECT_TCP is enforced only by the socket_connect LSM hook
(hook_socket_connect -> current_check_access_socket). security_socket_connect()
has exactly one call site in the tree, net/socket.c (the connect(2) syscall).

TCP Fast Open performs an implicit connect inside sendmsg:

  tcp_sendmsg_locked()            net/ipv4/tcp.c  (MSG_FASTOPEN branch)
   -> tcp_sendmsg_fastopen()      net/ipv4/tcp.c
   -> __inet_stream_connect(..., is_sendmsg=1)  net/ipv4/af_inet.c
   -> sk->sk_prot->connect()      net/ipv4/af_inet.c  -> tcp_v4_connect()

This path establishes the connection to the address taken from msg_name but
never calls security_socket_connect(). The only LSM hook fired on the sendmsg
path is security_socket_sendmsg(), and Landlock registers no socket_sendmsg
hook, so LANDLOCK_ACCESS_NET_CONNECT_TCP is never re-checked. __inet_stream_connect()
itself carries no LSM hook (only the cgroup-BPF pre_connect, a different
mechanism).

Notably the kernel already mediates the analogous AF_UNIX implicit-connect on the
send path via the unix_may_send hook, which Landlock does register
(hook_unix_may_send) -- so the sendmsg-implies-connect pattern is recognized, but
the TCP Fast Open case has no equivalent coverage. The MPTCP fast-open path
(mptcp_sendmsg_fastopen -> __inet_stream_connect) is a second producer of the
same unmediated connect (by source inspection; not separately reproduced).

Reproducer
----------
A self-contained, fully unprivileged PoC is available on request. It forks an
unconfined TFO-capable loopback listener, then in a child applies a Landlock
ruleset handling LANDLOCK_ACCESS_NET_CONNECT_TCP with no allow rule
(landlock_create_ruleset() with handled_access_net =
LANDLOCK_ACCESS_NET_CONNECT_TCP, no landlock_add_rule(), then
landlock_restrict_self(); every TCP connect is denied) and tries the forbidden
port two ways:

  (1) connect(fd, &dst)                 -> -EACCES   (Landlock enforces CONNECT_TCP)
  (2) sendto(fd2, buf, len, MSG_FASTOPEN, &dst, dstlen)
                                        -> succeeds; the listener accepts the
                                           connection and reads the payload.

Observed on Linux 7.0.11 (Landlock ABI 8):

  [1] connect(2)            -> ret=-1 errno=13 (Permission denied)
  [2] sendto(MSG_FASTOPEN)  -> ret=14 errno=0 (OK/queued)
  [+] listener ACCEPTED the confined child's connection; payload="..."

connect(2) to the port is denied while sendto(MSG_FASTOPEN) reaches the identical
port and delivers data.

Impact
------
A sandbox that uses LANDLOCK_ACCESS_NET_CONNECT_TCP to restrict outbound TCP
(e.g. to keep a confined component from reaching an internal service or a
metadata endpoint) can be escaped by an unprivileged, self-confined task with no
CAP and no namespace transition -- for any destination port, since the
implicit-connect path never consults the connect hook regardless of address (the
run above shows one port). It is an integrity
bypass of the network-confinement property; no memory safety is involved.
I score it CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:N (6.5 Medium) -- the
confined task escapes the policy authority that defined its sandbox, a scope
change; 5.5 if you treat the Landlock boundary as the same authority (S:U).

Note on the in-flight UDP series
--------------------------------
The "landlock: Add UDP access control support" series (v5, Matthieu Buffet,
https://lore.kernel.org/r/20260611162107.49278-3-matthieu@buffet.re) adds a
socket_sendmsg hook, hook_socket_sendmsg(), but it returns 0 for non-UDP
sockets:

    if (sk_is_udp(sock->sk))
            access_request = LANDLOCK_ACCESS_NET_CONNECT_SEND_UDP;
    else
            return 0;

so a TCP socket using MSG_FASTOPEN still bypasses LANDLOCK_ACCESS_NET_CONNECT_TCP
even after that series lands. It may be most convenient to fix this there.

Suggested direction
-------------------
Re-check LANDLOCK_ACCESS_NET_CONNECT_TCP on the implicit-connect path: either have
the socket_sendmsg hook evaluate CONNECT_TCP for stream sockets when the call
performs an implicit connect (mirroring the AF_UNIX unix_may_send handling), or
place the check inside __inet_stream_connect() so a single chokepoint covers
connect(2), TCP Fast Open, and the MPTCP fast-open sibling.

I am happy to send a patch for this if you would like me to.

Best regards,

Bryam Vargas
Independent security researcher, HEXLAB S.A.S., Cali, Colombia
hexlabsecurity@proton.me


^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: convert udp_lib_getsockopt to sockopt_t
From: David Laight @ 2026-06-16 20:16 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Stanislav Fomichev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Willem de Bruijn, Shuah Khan, netdev,
	linux-kernel, linux-kselftest, kernel-team
In-Reply-To: <ajF4Odi_L28LdIXC@gmail.com>

On Tue, 16 Jun 2026 09:22:52 -0700
Breno Leitao <leitao@debian.org> wrote:

> On Fri, Jun 12, 2026 at 07:10:15PM -0700, Stanislav Fomichev wrote:
> > On 06/12, Breno Leitao wrote:  
> 
> > >  int udp_lib_getsockopt(struct sock *sk, int level, int optname,
> > > -		       char __user *optval, int __user *optlen)
> > > +		       sockopt_t *opt)
> > >  {
> > >  	struct udp_sock *up = udp_sk(sk);
> > >  	int val, len;
> > >  
> > > -	if (get_user(len, optlen))
> > > -		return -EFAULT;  
> > 
> > [..]
> >   
> > > -	if (len < 0)
> > > -		return -EINVAL;  
> > 
> > I see this part now in sockopt_init_user, but you mention that it's a
> > transitional helper. When we drop it, will we loose this <0 check?
> > Maybe keep `if ((int)opt->optlen < 0))` here for backwards
> > compatibility?  
> 
> Good idea. I will do it and respin (once net-next reopens).

The best place for the negative length check is in the syscall wrapper code.
Pass an unsigned length through to all the protocol code.
No need to require every function to do the test.

Note that the length check was actually broken in many protocols
going way back well before git.
There has pretty much always been an unsigned min() check that converted
negative values to small(ish) positive ones before the check for it being
negative.
(That predates min() being a #define.)

The recent change to actually error optlen < 0 might actually have broken
some applications that passed uninitialised stack that was always negative!

-- David

> 
> Thanks for the review,
> --breno
> 

^ permalink raw reply

* Re: [PATCH] ice: retry reading NVM if admin queue returns EBUSY
From: kernel test robot @ 2026-06-16 20:18 UTC (permalink / raw)
  To: Robert Malz, anthony.l.nguyen, przemyslaw.kitszel
  Cc: oe-kbuild-all, intel-wired-lan, netdev
In-Reply-To: <20260616104521.1545053-1-robert.malz@canonical.com>

Hi Robert,

kernel test robot noticed the following build errors:

[auto build test ERROR on tnguy-next-queue/dev-queue]
[also build test ERROR on tnguy-net-queue/dev-queue linus/master v7.1 next-20260616]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Robert-Malz/ice-retry-reading-NVM-if-admin-queue-returns-EBUSY/20260616-185349
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
patch link:    https://lore.kernel.org/r/20260616104521.1545053-1-robert.malz%40canonical.com
patch subject: [PATCH] ice: retry reading NVM if admin queue returns EBUSY
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20260616/202606162237.EIrFZKip-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260616/202606162237.EIrFZKip-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606162237.EIrFZKip-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/net/ethernet/intel/ice/ice_nvm.c: In function 'ice_read_flat_nvm':
>> drivers/net/ethernet/intel/ice/ice_nvm.c:101:58: error: 'ICE_AQ_RC_EBUSY' undeclared (first use in this function); did you mean 'LIBIE_AQ_RC_EBUSY'?
     101 |                         if (hw->adminq.sq_last_status != ICE_AQ_RC_EBUSY ||
         |                                                          ^~~~~~~~~~~~~~~
         |                                                          LIBIE_AQ_RC_EBUSY
   drivers/net/ethernet/intel/ice/ice_nvm.c:101:58: note: each undeclared identifier is reported only once for each function it appears in


vim +101 drivers/net/ethernet/intel/ice/ice_nvm.c

    48	
    49	/**
    50	 * ice_read_flat_nvm - Read portion of NVM by flat offset
    51	 * @hw: pointer to the HW struct
    52	 * @offset: offset from beginning of NVM
    53	 * @length: (in) number of bytes to read; (out) number of bytes actually read
    54	 * @data: buffer to return data in (sized to fit the specified length)
    55	 * @read_shadow_ram: if true, read from shadow RAM instead of NVM
    56	 *
    57	 * Reads a portion of the NVM, as a flat memory space. This function correctly
    58	 * breaks read requests across Shadow RAM sectors and ensures that no single
    59	 * read request exceeds the maximum 4KB read for a single AdminQ command.
    60	 *
    61	 * Returns a status code on failure. Note that the data pointer may be
    62	 * partially updated if some reads succeed before a failure.
    63	 */
    64	int
    65	ice_read_flat_nvm(struct ice_hw *hw, u32 offset, u32 *length, u8 *data,
    66			  bool read_shadow_ram)
    67	{
    68		u32 inlen = *length;
    69		u32 bytes_read = 0;
    70		int retry_cnt = 0;
    71		bool last_cmd;
    72		int status;
    73	
    74		*length = 0;
    75	
    76		/* Verify the length of the read if this is for the Shadow RAM */
    77		if (read_shadow_ram && ((offset + inlen) > (hw->flash.sr_words * 2u))) {
    78			ice_debug(hw, ICE_DBG_NVM, "NVM error: requested offset is beyond Shadow RAM limit\n");
    79			return -EINVAL;
    80		}
    81	
    82		do {
    83			u32 read_size, sector_offset;
    84	
    85			/* ice_aq_read_nvm cannot read more than 4KB at a time.
    86			 * Additionally, a read from the Shadow RAM may not cross over
    87			 * a sector boundary. Conveniently, the sector size is also
    88			 * 4KB.
    89			 */
    90			sector_offset = offset % ICE_AQ_MAX_BUF_LEN;
    91			read_size = min_t(u32, ICE_AQ_MAX_BUF_LEN - sector_offset,
    92					  inlen - bytes_read);
    93	
    94			last_cmd = !(bytes_read + read_size < inlen);
    95	
    96			status = ice_aq_read_nvm(hw, ICE_AQC_NVM_START_POINT,
    97						 offset, read_size,
    98						 data + bytes_read, last_cmd,
    99						 read_shadow_ram, NULL);
   100			if (status) {
 > 101				if (hw->adminq.sq_last_status != ICE_AQ_RC_EBUSY ||
   102				    retry_cnt > ICE_SQ_SEND_MAX_EXECUTE)
   103					break;
   104				ice_debug(hw, ICE_DBG_NVM,
   105					  "NVM read EBUSY error, retry %d\n",
   106					  retry_cnt + 1);
   107				last_cmd = false;
   108				ice_release_nvm(hw);
   109				msleep(ICE_SQ_SEND_DELAY_TIME_MS);
   110				status = ice_acquire_nvm(hw, ICE_RES_READ);
   111				if (status)
   112					break;
   113				retry_cnt++;
   114			} else {
   115				bytes_read += read_size;
   116				offset += read_size;
   117				retry_cnt = 0;
   118			}
   119		} while (!last_cmd);
   120	
   121		*length = bytes_read;
   122		return status;
   123	}
   124	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH] net: faraday: ftmac100: convert to devm resource management
From: Jack Lee @ 2026-06-16 20:32 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew+netdev, edumazet, pabeni, netdev, linux-kernel, Jack Lee

Replace manual resource management with device-managed alternatives:
- alloc_etherdev() -> devm_alloc_etherdev()
- request_mem_region() + ioremap() -> devm_platform_ioremap_resource()

This simplifies error handling by removing manual cleanup in error
paths and the remove function, and eliminates the risk of resource
leaks.

Signed-off-by: Jack Lee <skunkolee@gmail.com>
---
 drivers/net/ethernet/faraday/ftmac100.c | 47 +++++--------------------
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftmac100.c b/drivers/net/ethernet/faraday/ftmac100.c
index 5803a382f0ba..adb318925f44 100644
--- a/drivers/net/ethernet/faraday/ftmac100.c
+++ b/drivers/net/ethernet/faraday/ftmac100.c
@@ -49,7 +49,6 @@ struct ftmac100_descs {
 };
 
 struct ftmac100 {
-	struct resource *res;
 	void __iomem *base;
 	int irq;
 
@@ -1137,11 +1136,9 @@ static int ftmac100_probe(struct platform_device *pdev)
 		return irq;
 
 	/* setup net_device */
-	netdev = alloc_etherdev(sizeof(*priv));
-	if (!netdev) {
-		err = -ENOMEM;
-		goto err_alloc_etherdev;
-	}
+	netdev = devm_alloc_etherdev(&pdev->dev, sizeof(*priv));
+	if (!netdev)
+		return -ENOMEM;
 
 	SET_NETDEV_DEV(netdev, &pdev->dev);
 	netdev->ethtool_ops = &ftmac100_ethtool_ops;
@@ -1150,7 +1147,7 @@ static int ftmac100_probe(struct platform_device *pdev)
 
 	err = platform_get_ethdev_address(&pdev->dev, netdev);
 	if (err == -EPROBE_DEFER)
-		goto defer_get_mac;
+		return err;
 
 	platform_set_drvdata(pdev, netdev);
 
@@ -1165,20 +1162,9 @@ static int ftmac100_probe(struct platform_device *pdev)
 	netif_napi_add(netdev, &priv->napi, ftmac100_poll);
 
 	/* map io memory */
-	priv->res = request_mem_region(res->start, resource_size(res),
-				       dev_name(&pdev->dev));
-	if (!priv->res) {
-		dev_err(&pdev->dev, "Could not reserve memory region\n");
-		err = -ENOMEM;
-		goto err_req_mem;
-	}
-
-	priv->base = ioremap(res->start, resource_size(res));
-	if (!priv->base) {
-		dev_err(&pdev->dev, "Failed to ioremap ethernet registers\n");
-		err = -EIO;
-		goto err_ioremap;
-	}
+	priv->base = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(priv->base))
+		return PTR_ERR(priv->base);
 
 	priv->irq = irq;
 
@@ -1208,32 +1194,17 @@ static int ftmac100_probe(struct platform_device *pdev)
 	return 0;
 
 err_register_netdev:
-	iounmap(priv->base);
-err_ioremap:
-	release_resource(priv->res);
-err_req_mem:
 	netif_napi_del(&priv->napi);
-defer_get_mac:
-	free_netdev(netdev);
-err_alloc_etherdev:
 	return err;
 }
 
 static void ftmac100_remove(struct platform_device *pdev)
 {
-	struct net_device *netdev;
-	struct ftmac100 *priv;
-
-	netdev = platform_get_drvdata(pdev);
-	priv = netdev_priv(netdev);
+	struct net_device *netdev = platform_get_drvdata(pdev);
+	struct ftmac100 *priv = netdev_priv(netdev);
 
 	unregister_netdev(netdev);
-
-	iounmap(priv->base);
-	release_resource(priv->res);
-
 	netif_napi_del(&priv->napi);
-	free_netdev(netdev);
 }
 
 static const struct of_device_id ftmac100_of_ids[] = {
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Jens Axboe @ 2026-06-16 20:47 UTC (permalink / raw)
  To: Octavian Purdila, netdev
  Cc: Alexander Viro, Andrew Morton, Arseniy Krasnov, David S. Miller,
	Eric Dumazet, Eugenio Pérez, Jakub Kicinski, Jason Wang, kvm,
	linux-block, linux-fsdevel, linux-kernel, Michael S. Tsirkin,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Stefano Garzarella,
	virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>

On 6/12/26 6:09 PM, Octavian Purdila wrote:
> Export iov_iter_restore so that it can be used by modules.
> 
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
> 
> Signed-off-by: Octavian Purdila <tavip@google.com>
> ---
>  lib/iov_iter.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 243662af1af73..067e745f9ef53 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  		i->__iov -= state->nr_segs - i->nr_segs;
>  	i->nr_segs = state->nr_segs;
>  }
> +EXPORT_SYMBOL(iov_iter_restore);

I don't have a problem exporting this to modules, but any new export
should be _GPL. So please change it to that.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v1 bpf-next 0/2] bpf: bpf_redirect_peer egress redirection
From: Jordan Rife @ 2026-06-16 20:49 UTC (permalink / raw)
  To: Paul Chaignon
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Stanislav Fomichev
In-Reply-To: <ajAXF8Nvg91xU4f2@mail.gmail.com>

> IMO, calling it BPF_F_EGRESS would be less confusing. It's a shame we
> can't have the same flag API between bpf_redirect() and
> bpf_redirect_peer(), but this is creating inconsistent semantics for
> the terms egress/ingress across the two helpers.

Yeah, one annoying thing about BPF_F_EGRESS is that it would only
apply to bpf_redirect_peer, so you still have inconsistencies across
helpers. Perhaps this is less weird than having BPF_F_INGRESS perform
an egress redirection though.

Jordan

^ permalink raw reply

* Re: [PATCH net] octeontx2-af: Validate NIX maximum LFs correctly
From: Subbaraya Sundeep @ 2026-06-16 20:51 UTC (permalink / raw)
  To: Simon Horman
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, netdev, linux-kernel
In-Reply-To: <20260603170759.GG3766816@horms.kernel.org>

Hi Simon,

On 2026-06-03 at 22:37:59, Simon Horman (horms@kernel.org) wrote:
> On Wed, Jun 03, 2026 at 05:59:46PM +0100, Simon Horman wrote:
> > From: 'Simon Horman' <horms@kernel.org>
> > 
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > Full review at: https://sashiko.dev
> 
> Sorry, my tooling isn't helping me much today.
> There was supposed to be some commentary in my previous email.
> Let me add it.
> 
> > ---
> > octeontx2-af: Validate NIX maximum LFs correctly
> > 
> > This patch changes the validation condition for setting the maximum
> > number of NIX LFs via devlink. It replaces the check for assigned MCAM
> > entries with a check for assigned NIX LFs to determine if resizing
> > is allowed.
> > 
> > > diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> > > index 6494a9ee2f0dd..8ba4b5ba9d2c7 100644
> > > --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> > > +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> > [ ... ]
> > > @@ -1521,11 +1523,18 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
> > >  	/* Do not allow user to modify maximum NIX LFs while mcam entries
> > >  	 * have already been assigned.
> > >  	 */
> > > -	mcam = &rvu->hw->mcam;
> > > -	if (mcam->bmap_fcnt < mcam->bmap_entries) {
> > > -		NL_SET_ERR_MSG_MOD(extack,
> > > -				   "mcam entries have already been assigned, can't resize");
> > > -		return -EPERM;
> > 
> 
> The issue below looks like it should be looked at in the context of this
> patch.
> 
> > [Severity: High]
> > Does removing the check for MCAM entries allow the devlink parameter
> > set operation to silently destroy active MCAM tracking state?
> > 
> > The .set callback unconditionally calls npc_mcam_rsrcs_deinit() and
> > npc_mcam_rsrcs_init(), which frees and reallocates the mcam->bmap. Since
> > MCAM entries can be allocated independently of NIX LFs (e.g. during
> > rvu_switch_enable() or default switchdev/PTP rules), could this cause
> > resource leaks and use-after-free bugs when untracked rules are later
> > accessed?
>
Some background:
 Say NIX hardware supports 128 LFs then
 1. 128 MCAM entries are always reserved by AF to receive unicast traffic to each LF
 2. PFs/VFs request a LF attachment during their probe via mbox message to AF
 3. devlink command to set NIX LFs is added later upon requests from customers/users
    so that if their system uses only 2 PFs then they can use the rest 126 mcam
    entries for additional traffic steering
 4. The only execption is user has to load PF and VF drivers as modules after
    setting NIX LFs as first command via AF devlink.
> > 
> > > +	blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
> > > +	while (blkaddr) {
> > > +		block = &rvu->hw->block[blkaddr];
> 
> While the one below seems like it can be assessed in the context
> of possible follow-up.
> 
> > [Severity: High]
> > This is a pre-existing issue, but does this locklessly read the resource
> > states?
> > 
> > It appears that the devlink validate and set callbacks execute without
> > holding the driver's resource locks (rvu->rsrc_lock or mcam->lock).
> > Could a concurrent mailbox request allocate an LF immediately after this
> > lockless validation check succeeds, leading to a race where the .set
> > callback frees structures that are actively being used?
> > 
As from above background this is not possible.
> > > +
> > > +		free_lfs = rvu_rsrc_free_count(&block->lf);
> > > +		if (free_lfs != block->lf.max) {
> > > +			NL_SET_ERR_MSG_MOD(extack,
> > > +					   "mcam entries have already been assigned, can't resize");
> > 
> 
> The one below seems less important. But perhaps it would be good
> to address if there is a v2 of this patch.
> 
> > [Severity: Medium]
> > Since the validation logic was updated to evaluate NIX LF allocations,
> > should this error message and the comment above be updated as well? The
> > message still states that "mcam entries have already been assigned", which
> > might confuse users since it's actually evaluating NIX LFs.
> > 
Sure will modify the message

Thanks,
Sundeep

> > > +			return -EPERM;
> > > +		}
> > > +
> > > +		blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
> > >  	}

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] udmabuf: emit one sg entry per pinned folio
From: Bobby Eshleman @ 2026-06-16 20:57 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Donald Hunter, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Andrew Lunn, Gerd Hoffmann,
	Sumit Semwal, Christian König, Shuah Khan, Jason Gunthorpe,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linux-media@vger.kernel.org,
	linaro-mm-sig@lists.linaro.org, linux-kselftest@vger.kernel.org,
	sdf@fomichev.me, razor@blackwall.org, daniel@iogearbox.net,
	almasrymina@google.com, matttbe@kernel.org, skhawaja@google.com,
	dw@davidwei.uk, Bobby Eshleman
In-Reply-To: <IA0PR11MB71852246277F773AC41DAAA3F8E52@IA0PR11MB7185.namprd11.prod.outlook.com>

On Tue, Jun 16, 2026 at 06:04:03AM +0000, Kasireddy, Vivek wrote:
> Adding Jason to this discussion.
> 
> Hi Bobby,
> 
> > Subject: [PATCH net-next v2 2/4] udmabuf: emit one sg entry per pinned
> > folio
> > 
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > 
> > get_sg_table() emitted one PAGE_SIZE sg entry per page even when the
> > underlying folio was larger.
> > 
> > Instead, walk folios[] and emit one sg entry per folio. When folios
> We have recently merged a patch (that will make it into 7.2) from Jason that
> replaced sg_set_folio() with sg_alloc_table_from_pages() in udmabuf driver:
> https://gitlab.freedesktop.org/drm/tip/-/commit/5bf888673e0dda5a53220fa0c4956271a46c353c
> 
> Since you are relying on sg_set_folio(), the core argument against its usage
> in udmabuf is that it doesn't work well with offsets > PAGE_SIZE, resulting
> in a malformed scatterlist. Not sure if this can be fixed easily.
> 
> > represent large pages (as is for MFD_HUGETLB), each sg entry is a large
> > page. Normal PAGE_SIZE sg tables are unchanged.
> > 
> > This is helpful for importers like net/core/devmem that expect dmabuf sg
> IMO, udmabuf needs to detect whether importers can handle segments that
> are > PAGE_SIZE and set the entries appropriately. Please look into how the
> GPU drivers and other dmabuf exporters/importers handle this situation, so
> that we can adopt best practices to address this issue.
> 
> Thanks,
> Vivek

Hey Vivek,

It sounds looks like that patch might solve my problem. I'll apply and
troubleshoot from there.

Thanks!

Best,
Bobby

> 
> > entries to be size and length aligned. Prior to this patch udmabuf
> > handed over one PAGE_SIZE sg entry per page, so devmem only saw
> > PAGE_SIZE chunks regardless of the underlying folio size.
> > 
> > dma_map_sgtable() does not always merge contiguous pages for us, so we
> > do this internally before exporting.
> > 
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > ---
> >  drivers/dma-buf/udmabuf.c | 52
> > ++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 47 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > index 94b8ecb892bb..9b751dd98b12 100644
> > --- a/drivers/dma-buf/udmabuf.c
> > +++ b/drivers/dma-buf/udmabuf.c
> > @@ -141,26 +141,68 @@ static void vunmap_udmabuf(struct dma_buf
> > *buf, struct iosys_map *map)
> >  	vm_unmap_ram(map->vaddr, ubuf->pagecount);
> >  }
> > 
> > +/* Return the number of contiguous pages backed by the folio at @i.
> > + * A udmabuf may map only part of a folio, or reference the same folio
> > + * in multiple non-contiguous runs, so folio_nr_pages() can't be used.
> > + */
> > +static pgoff_t udmabuf_folio_nr_pages(struct udmabuf *ubuf, pgoff_t i)
> > +{
> > +	struct folio *f = ubuf->folios[i];
> > +	pgoff_t j;
> > +
> > +	for (j = 1; i + j < ubuf->pagecount; j++) {
> > +		if (ubuf->folios[i + j] != f)
> > +			break;
> > +		/* Same folio, but not a sequential offset within it. */
> > +		if (ubuf->offsets[i + j] != ubuf->offsets[i] + j * PAGE_SIZE)
> > +			break;
> > +	}
> > +	return j;
> > +}
> > +
> > +/* Count the contiguous folio runs in @ubuf, one sg entry per run.
> > + *
> > + * Coalescing folios into a single sg entry up front lets importers actually
> > + * see large chunks. We can't rely on dma_map_sgtable() to do this for us
> > as
> > + * the dma_map_direct() path preserves the input scatterlist lengths
> > verbatim.
> > + */
> > +static unsigned int udmabuf_sg_nents(struct udmabuf *ubuf)
> > +{
> > +	unsigned int nents = 0;
> > +	pgoff_t i;
> > +
> > +	for (i = 0; i < ubuf->pagecount; i += udmabuf_folio_nr_pages(ubuf,
> > i))
> > +		nents++;
> > +	return nents;
> > +}
> > +
> >  static struct sg_table *get_sg_table(struct device *dev, struct dma_buf
> > *buf,
> >  				     enum dma_data_direction direction)
> >  {
> >  	struct udmabuf *ubuf = buf->priv;
> > -	struct sg_table *sg;
> >  	struct scatterlist *sgl;
> > -	unsigned int i = 0;
> > +	struct sg_table *sg;
> > +	pgoff_t i, run;
> > +	unsigned int nents;
> >  	int ret;
> > 
> > +	nents = udmabuf_sg_nents(ubuf);
> > +
> >  	sg = kzalloc_obj(*sg);
> >  	if (!sg)
> >  		return ERR_PTR(-ENOMEM);
> > 
> > -	ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
> > +	ret = sg_alloc_table(sg, nents, GFP_KERNEL);
> >  	if (ret < 0)
> >  		goto err_alloc;
> > 
> > -	for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
> > -		sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE,
> > +	sgl = sg->sgl;
> > +	for (i = 0; i < ubuf->pagecount; i += run) {
> > +		run = udmabuf_folio_nr_pages(ubuf, i);
> > +		sg_set_folio(sgl, ubuf->folios[i], run << PAGE_SHIFT,
> >  			     ubuf->offsets[i]);
> > +		sgl = sg_next(sgl);
> > +	}
> > 
> >  	ret = dma_map_sgtable(dev, sg, direction, 0);
> >  	if (ret < 0)
> > 
> > --
> > 2.53.0-Meta
> 

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Sabrina Dubroca @ 2026-06-16 21:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: syzbot, davem, edumazet, horms, john.fastabend, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <20260616082816.4dd0f035@kernel.org>

2026-06-16, 08:28:16 -0700, Jakub Kicinski wrote:
> On Tue, 16 Jun 2026 17:19:22 +0200 Sabrina Dubroca wrote:
> > I suspect err==0, and sock_error() consumed sk_err in between (the
> > alternative would be err > 0).
> > 
> > Something like this?
> 
> Makes sense, but what's eating sk_err?

The 2 remaining sock_error() in tls_rx_rec_wait()? [1]

> Don't we depend on it being set
> to avoid further state transitions once we hit a crypto error?

I kind of thought so too.

> I thought that's why we don't consume sk_err in recvmsg and sendmsg in
> the first place (we are not calling sock_error() anywhere)

Umm...
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/tls/tls_sw.c#n1095

-- 
Sabrina

^ permalink raw reply

* [PATCH] e1000: Remove redundant else after return
From: Lovekesh Solanki @ 2026-06-16 21:00 UTC (permalink / raw)
  To: anthony.l.nguyen
  Cc: przemyslaw.kitszel, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, Lovekesh Solanki

The else branch is needless because the preceding branch
unconditionally returns -ENOMEM

Reduce nesting by removing unnecessary else

Signed-off-by: Lovekesh Solanki <lovekeshsolanki00@gmail.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 9b09eb144b81..3d97e952c916 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1546,11 +1546,10 @@ static int e1000_setup_tx_resources(struct e1000_adapter *adapter,
 			      "for the transmit descriptor ring\n");
 			vfree(txdr->buffer_info);
 			return -ENOMEM;
-		} else {
+		}
 			/* Free old allocation, new allocation was successful */
 			dma_free_coherent(&pdev->dev, txdr->size, olddesc,
 					  olddma);
-		}
 	}
 	memset(txdr->desc, 0, txdr->size);
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next V3 2/7] netdevsim: Register devlink after device init
From: Jakub Kicinski @ 2026-06-16 21:05 UTC (permalink / raw)
  To: Mark Bloch
  Cc: Jiri Pirko, Eric Dumazet, Paolo Abeni, Andrew Lunn,
	David S. Miller, Jonathan Corbet, Shuah Khan, Simon Horman,
	Sunil Goutham, Linu Cherian, Geetha sowjanya, hariprasad,
	Subbaraya Sundeep, Bharat Bhushan, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Ethan Nelson-Moore, linux-doc,
	netdev, linux-rdma
In-Reply-To: <7635d50c-1c82-4090-8907-53a72444fc04@nvidia.com>

On Tue, 16 Jun 2026 20:29:25 +0300 Mark Bloch wrote:
> I think the explicit helper is the cleanest option here, without any
> workqueue fallback inside devlink. It avoids depending on devl_register()
> ordering, and makes the support explicit per driver.
> 
> Does that sound like an acceptable direction?

I'd much rather have the workqueue with the purely theoretical race
with user space than a bunch of drivers that don't act on the cmdline
params.

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Jakub Kicinski @ 2026-06-16 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, Petr Mladek, John Ogness,
	Sergey Senozhatsky, Vlad Poenaru, Thomas Gleixner, netdev,
	David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Breno Leitao, Clark Williams, Steven Rostedt, linux-rt-devel,
	linux-kernel, stable, Frederic Weisbecker, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616170257.GH49951@noisy.programming.kicks-ass.net>

On Tue, 16 Jun 2026 19:02:57 +0200 Peter Zijlstra wrote:
> > So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> > to NBCON console infrastructure"). Because from here now on writes are
> > deferred to the nbcon thread. So this purely about -stable in this case.  
> 
> Hmm, I thought netconsole had some reserved skbs and could to writes
> 'atomic' like? That said, it was 2.6 era the last time I looked at
> netconsole.

Yes, that part is fine. The problem is that netconsole tries
to reap Tx completions if the Tx queue is full. We can't call
skb destructor in irq context so we put the completed skbs on
a queue and try to arm softirq to get to them later.
Arming softirq causes a ksoftirq wake up.

We already skip the completion polling if we detect getting called
from the same networking driver. It's best effort, anyway.
Networking-side fix would be to toss another OR condition into
the skip. But we don't have one that'd work cleanly :S

^ permalink raw reply

* Re: [PATCH] net: faraday: ftmac100: convert to devm resource management
From: Jakub Kicinski @ 2026-06-16 21:21 UTC (permalink / raw)
  To: Jack Lee; +Cc: davem, andrew+netdev, edumazet, pabeni, netdev, linux-kernel
In-Reply-To: <20260616203233.55234-1-skunkolee@gmail.com>

On Tue, 16 Jun 2026 13:32:33 -0700 Jack Lee wrote:
> Replace manual resource management with device-managed alternatives:
> - alloc_etherdev() -> devm_alloc_etherdev()
> - request_mem_region() + ioremap() -> devm_platform_ioremap_resource()
> 
> This simplifies error handling by removing manual cleanup in error
> paths and the remove function, and eliminates the risk of resource
> leaks.

net-next is closed right now. Also:

Quoting documentation:

  Clean-up patches
  ~~~~~~~~~~~~~~~~
  
  Netdev discourages patches which perform simple clean-ups, which are not in
  the context of other work. For example:
  
  * Addressing ``checkpatch.pl``, and other trivial coding style warnings
  * Addressing :ref:`Local variable ordering<rcs>` issues
  * Conversions to device-managed APIs (``devm_`` helpers)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  

  This is because it is felt that the churn that such changes produce comes
  at a greater cost than the value of such clean-ups.
  
  Conversely, spelling and grammar fixes are not discouraged.
  
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#clean-up-patches
-- 
pw-bot: reject

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Jakub Kicinski @ 2026-06-16 21:23 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: syzbot, davem, edumazet, horms, john.fastabend, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <ajG5hg9oJvyxPplG@krikkit>

On Tue, 16 Jun 2026 23:00:54 +0200 Sabrina Dubroca wrote:
> 2026-06-16, 08:28:16 -0700, Jakub Kicinski wrote:
> > On Tue, 16 Jun 2026 17:19:22 +0200 Sabrina Dubroca wrote:  
> > > I suspect err==0, and sock_error() consumed sk_err in between (the
> > > alternative would be err > 0).
> > > 
> > > Something like this?  
> > 
> > Makes sense, but what's eating sk_err?  
> 
> The 2 remaining sock_error() in tls_rx_rec_wait()? [1]

How did that elude my grep..

> > Don't we depend on it being set
> > to avoid further state transitions once we hit a crypto error?  
> 
> I kind of thought so too.

In which case the question is whether we should try to remove 
the sock_error() instead? (stating the obvious I guess)

> > I thought that's why we don't consume sk_err in recvmsg and sendmsg in
> > the first place (we are not calling sock_error() anywhere)  
> 
> Umm...
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/tls/tls_sw.c#n1095
> 


^ permalink raw reply

* Re: [PATCH net-next v2] net: dsa: Fix skb ownership in taggers
From: Vladimir Oltean @ 2026-06-16 21:37 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
	Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh, UNGLinuxDriver,
	Chester A. Unal, Daniel Golle, Matthias Brugger,
	AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
	Clément Léger, George McCollister, David Yang, netdev,
	Sashiko AI Review
In-Reply-To: <20260616-dsa-fix-free-skb-v2-1-9dbda6a19e97@kernel.org>

On Tue, Jun 16, 2026 at 11:36:22AM +0200, Linus Walleij wrote:
> The tag_8021q.c tagger calls vlan_insert_tag() in dsa_8021q_xmit().
> vlan_insert_tag() will consume the skb with kfree_skb() on failure
> and return NULL.
> 
> When NULL is returned as error code to ->xmit() in dsa_user_xmit()
> it will free the same skb again leading to a double-free.
> 
> The idea of dsa_user_xmit() and dsa_switch_rcv() dropping the skb
> they held before the call to ->xmit() and ->rcv() is conceptually
> wrong: the pattern elsewhere in the networking code is that consumers
> drop their skb:s on failure.
> 
> Modify the ->xmit() and ->rcv() call sites to not drop the SKB if
> the taggers return NULL from any of these calls. Move those drops into
> the taggers so every callback error path that retains ownership consumes
> the skb before returning NULL.
> 
> Keep the existing helper ownership rules: VLAN insertion helpers already
> free on failure (this is the case in tag_8021q.c), while deferred
> transmit paths either transfer the skb reference to worker context or
> hold a worker reference with skb_get() and drop the caller's reference.
> 
> For SJA1105 meta RX, transfer the buffered stampable skb under the meta
> lock and return NULL while the skb is waiting for its meta frame: the
> skb is not dropped in this case.
> 
> Reported-by: Sashiko AI Review <sashiko-bot@kernel.org>
> Closes: https://lore.kernel.org/r/20260610153952.1685895-1-kuba@kernel.org/
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Assisted-by: Codex:gpt-5-5
> Acked-by: David Yang <mmyangfl@gmail.com> # yt921x
> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
> Signed-off-by: Linus Walleij <linusw@kernel.org>
> ---
> Changes in v2:
> - In some instances __skb_pad() and __skb_put_padto() followed by a
>   kfree_skb() could be simplified to just call skb_pad() and
>   skb_put_padto() which will free the skb on failure.
> - Use a label and goto for the kfree_skb(); return NULL; in
>   the netc_rcv() callback in tag_netc.c as requested.
> - Collect ACKs.
> - Retag for net-next.
> - Link to v1: https://patch.msgid.link/20260616-dsa-fix-free-skb-v1-1-fd30b35dcf66@kernel.org
> ---

From my perspective, the tradeoff between pros and cons is not so well
explained. Consider the following not mentioned in the commit message:

- Changing the kfree_skb() convention, without any mechanical obstacle
  preventing the backporting of patches that are written assuming one
  convention down to trees expecting the other (obstacle like a failure
  to compile, for example, which would warn people of their otherwise
  silent incompatibility), is an avoidable experience (at best) from a
  maintainance perspective.

- Has anyone proven that a real problem exists? Because dsa_user_xmit()
  -> skb_ensure_writable_head_tail() has run successfully at this stage,
  so we know that dev->needed_headroom bytes are available for writing.
  Because DSA uses VLAN as a tag, dsa_user_setup_tagger() will increase
  dev->needed_headroom by VLAN_HLEN for the tag_8021q protocols, so
  vlan_insert_tag() should not fail. I've looked at this function at it
  seems not to be coded up to fail for any other reason.

Otherwise, sure, it seems cleaner this way, but the way I see it, it
risks introducing more issues than it fixes. If maintainers feel
different about this please go ahead, but given the fact that I don't
really have a lot of time to do proper review during this period, I'm
more on the pragmatic side on this one.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox