Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net 0/2] net: enetc: fix command BD ring issues
From: patchwork-bot+netdevbpf @ 2026-04-17  2:40 UTC (permalink / raw)
  To: Wei Fang
  Cc: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni, chleroy, netdev, linux-kernel, imx,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <20260415060833.2303846-1-wei.fang@nxp.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 15 Apr 2026 14:08:31 +0800 you wrote:
> Currently, the implementation of command BD ring has two issues, one is
> that the driver may obtain wrong consumer index of the ring, because the
> driver does not mask out the SBE bit of the CIR value, so a wrong index
> will be obtained when a SBE error ouccrs. The other one is that the DMA
> buffer may be used after free. If netc_xmit_ntmp_cmd() times out and
> returns an error, the pending command is not explicitly aborted, while
> ntmp_free_data_mem() unconditionally frees the DMA buffer. If the buffer
> has already been reallocated elsewhere, this may lead to silent memory
> corruption. Because the hardware eventually processes the pending command
> and perform a DMA write of the response to the physical address of the
> freed buffer. So this patch set is to fix these two issues.
> 
> [...]

Here is the summary with links:
  - [v2,net,1/2] net: enetc: correct the command BD ring consumer index
    https://git.kernel.org/netdev/net/c/759a32900b6f
  - [v2,net,2/2] net: enetc: fix NTMP DMA use-after-free issue
    https://git.kernel.org/netdev/net/c/3cade698881e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net v4 0/3] vsock/virtio: fix MSG_PEEK calculation on bytes to copy
From: patchwork-bot+netdevbpf @ 2026-04-17  2:40 UTC (permalink / raw)
  To: Luigi Leonardi
  Cc: stefanha, sgarzare, mst, jasowang, xuanzhuo, eperezma, davem,
	edumazet, kuba, pabeni, horms, avkrasnov, kvm, virtualization,
	netdev, linux-kernel
In-Reply-To: <20260415-fix_peek-v4-0-8207e872759e@redhat.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 15 Apr 2026 17:09:27 +0200 you wrote:
> `virtio_transport_stream_do_peek`, when calculating the number of bytes to
> copy, didn't consider the `offset`, caused by partial reads that happened
> before.
> This might cause out-of-bounds read that lead to an EFAULT.
> More details in the commits.
> 
> Commit 1 introduces the fix
> Commit 2 introduces some preliminary work for adding a test and fixes a
> problem in existing tests.
> Commit 3 introduces a test that checks for this bug to avoid future
> regressions.
> 
> [...]

Here is the summary with links:
  - [net,v4,1/3] vsock/virtio: fix MSG_PEEK ignoring skb offset when calculating bytes to copy
    https://git.kernel.org/netdev/net/c/080f22f5d302
  - [net,v4,2/3] vsock/test: fix MSG_PEEK handling in recv_buf()
    https://git.kernel.org/netdev/net/c/a3f77afbf67d
  - [net,v4,3/3] vsock/test: add MSG_PEEK after partial recv test
    https://git.kernel.org/netdev/net/c/2a2675ef6190

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] net: ipv4: igmp: add sysctl option to ignore inbound llm_reports
From: Jakub Kicinski @ 2026-04-17  2:33 UTC (permalink / raw)
  To: Steffen Trumtrar
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, David Ahern, netdev, linux-doc,
	linux-kernel, Ido Schimmel
In-Reply-To: <20260415-v7-0-topic-igmp-llm-drop-v1-1-1367bfbb898e@pengutronix.de>

On Wed, 15 Apr 2026 12:26:13 +0200 Steffen Trumtrar wrote:
> Add a new sysctl option 'igmp_link_local_mcast_reports_drop' that allows
> dropping inbound IGMP reports for link-local multicast groups in the
> 224.0.0.X range. This can be used to prevent the local system from
> processing IGMP reports for link local multicast groups and therefore
> let the kernel still send the own outbound IGMP reports.

+Ido to CC

I'm not sure what is reasonable here and what should be a firewall rule.
Either way:

## Form letter - net-next-closed

We have already submitted our pull request with net-next material for
v7.1, and therefore net-next is closed for new drivers, features, code
refactoring and optimizations. We are currently accepting bug fixes
only.

Please repost when net-next reopens after Apr 27th.

RFC patches sent for review only are obviously welcome at any time.

See:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle

>  Documentation/networking/ip-sysctl.rst                       | 12 ++++++++++++
>  .../networking/net_cachelines/netns_ipv4_sysctl.rst          |  1 +
>  include/net/netns/ipv4.h                                     |  1 +
>  net/ipv4/af_inet.c                                           |  1 +
>  net/ipv4/igmp.c                                              |  2 ++
>  net/ipv4/sysctl_net_ipv4.c                                   |  7 +++++++
>  6 files changed, 24 insertions(+)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6921d8594b849..2da4cd6ac7202 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -2306,6 +2306,18 @@ igmp_link_local_mcast_reports - BOOLEAN
>  
>  	Default TRUE
>  
> +igmp_link_local_mcast_reports_drop - BOOLEAN
> +	Drop inbound IGMP reports for link local multicast groups in
> +	the 224.0.0.X range. When enabled, IGMP membership reports for
> +	link local multicast addresses are silently dropped without
> +	processing.
> +	When the kernel gets inbound IGMP reports it stops sending own
> +	IGMP reports. With allowing to drop and process the inbound reports,
> +	the kernel will not stop sending the own reports, even when IGMP
> +	reports from other hosts are seen on the network.
> +
> +	Default FALSE
> +
>  Alexey Kuznetsov.
>  kuznet@ms2.inr.ac.ru
>  
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19bf..703afe2ba063b 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -140,6 +140,7 @@ int                             sysctl_udp_rmem_min
>  u8                              sysctl_fib_notify_on_flag_change
>  u8                              sysctl_udp_l3mdev_accept
>  u8                              sysctl_igmp_llm_reports
> +u8                              sysctl_igmp_llm_reports_drop
>  int                             sysctl_igmp_max_memberships
>  int                             sysctl_igmp_max_msf
>  int                             sysctl_igmp_qrv
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 8e971c7bf1646..1453f825ffd4d 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -258,6 +258,7 @@ struct netns_ipv4 {
>  	u8 sysctl_igmp_llm_reports;
>  	int sysctl_igmp_max_memberships;
>  	int sysctl_igmp_max_msf;
> +	u8 sysctl_igmp_llm_reports_drop;
>  	int sysctl_igmp_qrv;
>  
>  	struct ping_group_range ping_group_range;
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index c7731e300a442..b8f96a5d8afdc 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1825,6 +1825,7 @@ static __net_init int inet_init_net(struct net *net)
>  	net->ipv4.sysctl_igmp_max_msf = 10;
>  	/* IGMP reports for link-local multicast groups are enabled by default */
>  	net->ipv4.sysctl_igmp_llm_reports = 1;
> +	net->ipv4.sysctl_igmp_llm_reports_drop = 0;
>  	net->ipv4.sysctl_igmp_qrv = 2;
>  
>  	net->ipv4.sysctl_fib_notify_on_flag_change = 0;
> diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
> index a674fb44ec25b..3a4932e4108bd 100644
> --- a/net/ipv4/igmp.c
> +++ b/net/ipv4/igmp.c
> @@ -931,6 +931,8 @@ static bool igmp_heard_report(struct in_device *in_dev, __be32 group)
>  	if (ipv4_is_local_multicast(group) &&
>  	    !READ_ONCE(net->ipv4.sysctl_igmp_llm_reports))
>  		return false;
> +	if (READ_ONCE(net->ipv4.sysctl_igmp_llm_reports_drop))
> +		return true;
>  
>  	rcu_read_lock();
>  	for_each_pmc_rcu(in_dev, im) {
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 5654cc9c8a0b9..24dde84d289e4 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -948,6 +948,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dou8vec_minmax,
>  	},
> +	{
> +		.procname	= "igmp_link_local_mcast_reports_drop",
> +		.data		= &init_net.ipv4.sysctl_igmp_llm_reports_drop,
> +		.maxlen		= sizeof(u8),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dou8vec_minmax,
> +	},
>  	{
>  		.procname	= "igmp_max_memberships",
>  		.data		= &init_net.ipv4.sysctl_igmp_max_memberships,
> 
> ---
> base-commit: 028ef9c96e96197026887c0f092424679298aae8
> change-id: 20260415-v7-0-topic-igmp-llm-drop-e4c13dbf17cc
> 
> Best regards,
> --  
> Steffen Trumtrar <s.trumtrar@pengutronix.de>
> 
-- 
pw-bot: defer
pv-bot: closed



^ permalink raw reply

* [PATCH 2/2] bpf: guard sock_ops rtt_min against non-locked tcp_sock
From: Werner Kasselman @ 2026-04-17  2:31 UTC (permalink / raw)
  To: bpf@vger.kernel.org, netdev@vger.kernel.org
  Cc: andrii@kernel.org, ast@kernel.org, brakmo@fb.com,
	daniel@iogearbox.net, davem@davemloft.net, eddyz87@gmail.com,
	edumazet@google.com, haoluo@google.com, horms@kernel.org,
	john.fastabend@gmail.com, jolsa@kernel.org, kpsingh@kernel.org,
	kuba@kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, martin.lau@linux.dev,
	pabeni@redhat.com, sdf@fomichev.me, shuah@kernel.org,
	song@kernel.org, yonghong.song@linux.dev, jiayuan.chen@linux.dev,
	Werner Kasselman, stable@vger.kernel.org
In-Reply-To: <20260417023119.3830723-1-werner@verivus.com>

sock_ops_convert_ctx_access() reads rtt_min without the
is_locked_tcp_sock guard used for every other tcp_sock field. On
request_sock-backed sock_ops callbacks, sk points at a
tcp_request_sock and the converted load reads past the end of the
allocation.

Reuse SOCK_OPS_LOAD_TCP_SOCK_FIELD() for the rtt_min access and compute
the offset with offsetof(struct minmax_sample, v). This leaves the byte
addressed unchanged from the old sizeof_field(struct minmax_sample, t)
expression, while making rtt_min consistent with every other tcp_sock
field.

This also picks up the same dst_reg == src_reg handling used by the
other guarded field loads. Extend the sock_ops_get_sk selftest with an
rtt_min subtest that checks request_sock-backed !fullsock callbacks read
zero instead of leaking request_sock-adjacent memory.

Found via AST-based call-graph analysis using sqry.

Fixes: 44f0e43037d3 ("bpf: Add support for reading sk_state and more")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
---
 net/core/filter.c                             | 12 +++----
 .../bpf/prog_tests/sock_ops_get_sk.c          |  9 ++++++
 .../selftests/bpf/progs/sock_ops_get_sk.c     | 31 +++++++++++++++++++
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e8ad062f63bc..9c43193a5c39 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10827,14 +10827,12 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 			     sizeof(struct minmax));
 		BUILD_BUG_ON(sizeof(struct minmax) <
 			     sizeof(struct minmax_sample));
+		BUILD_BUG_ON(offsetof(struct tcp_sock, rtt_min) +
+			     offsetof(struct minmax_sample, v) > S16_MAX);
 
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
-						struct bpf_sock_ops_kern, sk),
-				      si->dst_reg, si->src_reg,
-				      offsetof(struct bpf_sock_ops_kern, sk));
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
-				      offsetof(struct tcp_sock, rtt_min) +
-				      sizeof_field(struct minmax_sample, t));
+		off = offsetof(struct tcp_sock, rtt_min) +
+		      offsetof(struct minmax_sample, v);
+		SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_W, off);
 		break;
 
 	case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
diff --git a/tools/testing/selftests/bpf/prog_tests/sock_ops_get_sk.c b/tools/testing/selftests/bpf/prog_tests/sock_ops_get_sk.c
index 343d92c4df30..1aea4c97d5d3 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_ops_get_sk.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_ops_get_sk.c
@@ -70,6 +70,15 @@ void test_ns_sock_ops_get_sk(void)
 		ASSERT_EQ(skel->bss->diff_reg_bug_detected, 0, "diff_reg_bug_not_detected");
 	}
 
+	/* Test sock_ops rtt_min access in !fullsock callbacks */
+	if (test__start_subtest("get_rtt_min")) {
+		run_sock_ops_test(cgroup_fd,
+				  bpf_program__fd(skel->progs.sock_ops_get_rtt_min));
+		ASSERT_EQ(skel->bss->rtt_min_null_seen, 1, "rtt_min_null_seen");
+		ASSERT_EQ(skel->bss->rtt_min_bug_detected, 0,
+			  "rtt_min_bug_not_detected");
+	}
+
 	sock_ops_get_sk__destroy(skel);
 close_cgroup:
 	close(cgroup_fd);
diff --git a/tools/testing/selftests/bpf/progs/sock_ops_get_sk.c b/tools/testing/selftests/bpf/progs/sock_ops_get_sk.c
index 3a0689f8ce7c..dee07da8901e 100644
--- a/tools/testing/selftests/bpf/progs/sock_ops_get_sk.c
+++ b/tools/testing/selftests/bpf/progs/sock_ops_get_sk.c
@@ -114,4 +114,35 @@ __naked void sock_ops_get_sk_diff_reg(void)
 		: __clobber_all);
 }
 
+/* sock_ops rtt_min access: different-register, is_locked_tcp_sock == 0 path (TCP_NEW_SYN_RECV). */
+int rtt_min_bug_detected;
+int rtt_min_null_seen;
+
+SEC("sockops")
+__naked void sock_ops_get_rtt_min(void)
+{
+	asm volatile (
+		"r7 = *(u32 *)(r1 + %[is_fullsock_off]);"
+		"r2 = *(u32 *)(r1 + %[rtt_min_off]);"
+		"if r7 != 0 goto 2f;"
+		"if r2 == 0 goto 1f;"
+		"r1 = %[rtt_min_bug_detected] ll;"
+		"r3 = 1;"
+		"*(u32 *)(r1 + 0) = r3;"
+		"goto 2f;"
+	"1:"
+		"r1 = %[rtt_min_null_seen] ll;"
+		"r3 = 1;"
+		"*(u32 *)(r1 + 0) = r3;"
+	"2:"
+		"r0 = 1;"
+		"exit;"
+		:
+		: __imm_const(is_fullsock_off, offsetof(struct bpf_sock_ops, is_fullsock)),
+		  __imm_const(rtt_min_off, offsetof(struct bpf_sock_ops, rtt_min)),
+		  __imm_addr(rtt_min_bug_detected),
+		  __imm_addr(rtt_min_null_seen)
+		: __clobber_all);
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/2] bpf: extract SOCK_OPS_LOAD_TCP_SOCK_FIELD from SOCK_OPS_GET_FIELD
From: Werner Kasselman @ 2026-04-17  2:31 UTC (permalink / raw)
  To: bpf@vger.kernel.org, netdev@vger.kernel.org
  Cc: andrii@kernel.org, ast@kernel.org, brakmo@fb.com,
	daniel@iogearbox.net, davem@davemloft.net, eddyz87@gmail.com,
	edumazet@google.com, haoluo@google.com, horms@kernel.org,
	john.fastabend@gmail.com, jolsa@kernel.org, kpsingh@kernel.org,
	kuba@kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, martin.lau@linux.dev,
	pabeni@redhat.com, sdf@fomichev.me, shuah@kernel.org,
	song@kernel.org, yonghong.song@linux.dev, jiayuan.chen@linux.dev,
	Werner Kasselman
In-Reply-To: <20260417023119.3830723-1-werner@verivus.com>

Extract the guarded sock_ops field load sequence out of
SOCK_OPS_GET_FIELD() into SOCK_OPS_LOAD_TCP_SOCK_FIELD(FIELD_SIZE,
FIELD_OFFSET) so callers that need a custom field offset can reuse the
same is_locked_tcp_sock guard and dst_reg == src_reg handling.

This is needed for rtt_min, which reads a minmax sub-field rather than a
direct struct member.

No functional change.

Signed-off-by: Werner Kasselman <werner@verivus.com>
---
 net/core/filter.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..e8ad062f63bc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10535,12 +10535,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	struct bpf_insn *insn = insn_buf;
 	int off;
 
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+/* Helper macro for adding guarded read access to tcp_sock fields. */
+#define SOCK_OPS_LOAD_TCP_SOCK_FIELD(FIELD_SIZE, FIELD_OFFSET)		      \
 	do {								      \
 		int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 2;     \
-		BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) >		      \
-			     sizeof_field(struct bpf_sock_ops, BPF_FIELD));   \
 		if (si->dst_reg == reg || si->src_reg == reg)		      \
 			reg--;						      \
 		if (si->dst_reg == reg || si->src_reg == reg)		      \
@@ -10567,10 +10565,9 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 						struct bpf_sock_ops_kern, sk),\
 				      si->dst_reg, si->src_reg,		      \
 				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
-						       OBJ_FIELD),	      \
+		*insn++ = BPF_LDX_MEM(FIELD_SIZE,			      \
 				      si->dst_reg, si->dst_reg,		      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
+				      FIELD_OFFSET);			      \
 		if (si->dst_reg == si->src_reg)	{			      \
 			*insn++ = BPF_JMP_A(2);				      \
 			*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg,	      \
@@ -10580,6 +10577,14 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		}							      \
 	} while (0)
 
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) >		      \
+			     sizeof_field(struct bpf_sock_ops, BPF_FIELD));   \
+		SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),\
+					     offsetof(OBJ, OBJ_FIELD));       \
+	} while (0)
+
 #define SOCK_OPS_GET_SK()							      \
 	do {								      \
 		int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 1;     \
-- 
2.43.0


^ permalink raw reply related

* [PATCH bpf v3 0/2] bpf: fix sock_ops rtt_min OOB read
From: Werner Kasselman @ 2026-04-17  2:31 UTC (permalink / raw)
  To: bpf@vger.kernel.org, netdev@vger.kernel.org
  Cc: andrii@kernel.org, ast@kernel.org, brakmo@fb.com,
	daniel@iogearbox.net, davem@davemloft.net, eddyz87@gmail.com,
	edumazet@google.com, haoluo@google.com, horms@kernel.org,
	john.fastabend@gmail.com, jolsa@kernel.org, kpsingh@kernel.org,
	kuba@kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, martin.lau@linux.dev,
	pabeni@redhat.com, sdf@fomichev.me, shuah@kernel.org,
	song@kernel.org, yonghong.song@linux.dev, jiayuan.chen@linux.dev,
	Werner Kasselman

Patch 1 from v2 is upstream as 10f86a2a5c91 ("bpf: Fix same-register
dst/src OOB read and pointer leak in sock_ops"), so this reroll keeps
only the remaining work for ctx->rtt_min.

Patch 1 extracts the guarded field-load sequence out of
SOCK_OPS_GET_FIELD() so the rtt_min sub-field access can reuse the same
is_locked_tcp_sock guard and dst_reg == src_reg handling without
open-coding it again.

Patch 2 uses that helper for rtt_min and extends the landed
sock_ops_get_sk selftest with an rtt_min subtest covering the
request_sock-backed !fullsock path.

Changes since v2:
- drop the overlapping dst_reg == src_reg fix, now upstream as
  10f86a2a5c91
- keep only the helper extraction plus the rtt_min fix
- add an rtt_min subtest on top of the landed sock_ops_get_sk selftest

Werner Kasselman (2):
  bpf: extract SOCK_OPS_LOAD_TCP_SOCK_FIELD from SOCK_OPS_GET_FIELD
  bpf: guard sock_ops rtt_min against non-locked tcp_sock

 net/core/filter.c                             | 31 ++++++++++---------
 .../bpf/prog_tests/sock_ops_get_sk.c          |  9 ++++++
 .../selftests/bpf/progs/sock_ops_get_sk.c     | 31 +++++++++++++++++++
 3 files changed, 57 insertions(+), 14 deletions(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH net v2 1/2] bnge: fix initial HWRM sequence
From: Jakub Kicinski @ 2026-04-17  2:22 UTC (permalink / raw)
  To: vikas.gupta
  Cc: Jakub Kicinski, davem, edumazet, pabeni, andrew+netdev, horms,
	netdev, linux-kernel, vsrama-krishna.nemani, bhargava.marreddy,
	rajashekar.hudumula, ajit.khaparde, dharmender.garg,
	rahul-rg.gupta
In-Reply-To: <20260415151621.1104956-2-vikas.gupta@broadcom.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
bnge: fix initial HWRM sequence

This commit modifies the initial HWRM sequence to allocate the context
memory before querying resource information. This addresses an issue where
firmware might not advertise correct resources.

> diff --git a/drivers/net/ethernet/broadcom/bnge/bnge_core.c b/drivers/net/ethernet/broadcom/bnge/bnge_core.c
> --- a/drivers/net/ethernet/broadcom/bnge/bnge_core.c
> +++ b/drivers/net/ethernet/broadcom/bnge/bnge_core.c
> @@ -73,25 +73,35 @@ static int bnge_func_qcaps(struct bnge_dev *bd)

[ ... ]

>  	return 0;
> +
> +err_free_ctx_mem:
> +	bnge_free_ctx_mem(bd);
> +	return rc;
>  }

Could this error path leave the firmware with active DMA pointers to freed
memory?

Looking at bnge_alloc_ctx_mem(), it configures the firmware with host DMA
addresses for the backing store via bnge_backing_store_cfg(). If a
subsequent step like bnge_hwrm_func_resc_qcaps() fails, this new error path
directly frees the context memory.

Previously, errors in this sequence triggered bnge_fw_unregister_dev(),
which safely unregistered the firmware before freeing the memory. There is
even a comment in bnge_fw_unregister_dev() documenting this requirement:

static void bnge_fw_unregister_dev(struct bnge_dev *bd)
{
        /* ctx mem free after unrgtr only */
        bnge_hwrm_func_drv_unrgtr(bd);
        bnge_free_ctx_mem(bd);
}

Since the firmware isn't told to stop using the memory before it is freed
here, is there a risk the hardware might DMA into the freed host pages?

> @@ -132,32 +142,25 @@ static int bnge_fw_register_dev(struct bnge_dev *bd)

[ ... ]

> -err_func_unrgtr:
> -	bnge_fw_unregister_dev(bd);
> +err_free_ctx_mem:
> +	bnge_free_ctx_mem(bd);
>  	return rc;
>  }

This error path appears to have the same regression. If
bnge_hwrm_func_drv_rgtr() fails after bnge_func_qcaps() has already
configured the backing store, freeing the context memory directly without
unregistering might allow the hardware to access freed memory.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net v2] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup()
From: patchwork-bot+netdevbpf @ 2026-04-17  2:10 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
	linux-mediatek, netdev
In-Reply-To: <20260414-airoha-wait-for-npu-config-offload-setup-v2-1-5a9bf6d43aee@kernel.org>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 14 Apr 2026 16:08:52 +0200 you wrote:
> In order to properly enable flowtable hw offloading, poll
> REG_PPE_FLOW_CFG register in airoha_ppe_offload_setup routine and
> wait for NPU PPE configuration triggered by ppe_init callback to complete
> before running airoha_ppe_hw_init().
> 
> Fixes: 00a7678310fe3 ("net: airoha: Introduce flowtable offload support")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> 
> [...]

Here is the summary with links:
  - [net,v2] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup()
    https://git.kernel.org/netdev/net/c/f3206328bb52

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] net/sched: taprio: fix use-after-free in advance_sched() on schedule switch
From: patchwork-bot+netdevbpf @ 2026-04-17  2:10 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: vladimir.oltean, jhs, jiri, davem, edumazet, kuba, pabeni, horms,
	netdev, linux-kernel, qjx1298677004
In-Reply-To: <20260410-taprio-user-after-free-fix-net-v1-1-0931b1579988@intel.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 10 Apr 2026 18:57:57 -0700 you wrote:
> In advance_sched(), when should_change_schedules() returns true,
> switch_schedules() is called to promote the admin schedule to oper.
> switch_schedules() queues the old oper schedule for RCU freeing via
> call_rcu(), but 'next' still points into an entry of the old oper
> schedule. The subsequent 'next->end_time = end_time' and
> rcu_assign_pointer(q->current_entry, next) are use-after-free.
> 
> [...]

Here is the summary with links:
  - [net] net/sched: taprio: fix use-after-free in advance_sched() on schedule switch
    https://git.kernel.org/netdev/net/c/105425b1969c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] net: mdio: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
From: patchwork-bot+netdevbpf @ 2026-04-17  2:10 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: charles.perry, conor, kuba, maxime.chevallier, andrew, hkallweit1,
	linux, davem, edumazet, pabeni, netdev, linux-kernel
In-Reply-To: <980c57efa5843733ef95459c3283aebade56f142.1776162544.git.geert+renesas@glider.be>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 14 Apr 2026 12:30:47 +0200 you wrote:
> The PIC64-HPSC/HX MDIO interface is only present on Microchip
> PIC64-HPSC/HX SoCs.  Hence add a dependency on ARCH_MICROCHIP, to
> prevent asking the user about this driver when configuring a kernel
> without Microchip SoC support.
> 
> Fixes: f76aef980206e7c6 ("net: mdio: add a driver for PIC64-HPSC/HX MDIO controller")
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> 
> [...]

Here is the summary with links:
  - net: mdio: MDIO_PIC64HPSC should depend on ARCH_MICROCHIP
    https://git.kernel.org/netdev/net/c/1e9e7fd839b7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Jakub Kicinski @ 2026-04-17  2:08 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>

On Tue, 14 Apr 2026 08:13:28 -0700 Aditya Garg wrote:
> The MANA driver can fail to load on systems with high memory
> utilization because several allocations in the queue setup paths
> require large physically contiguous blocks via kmalloc. Under memory
> fragmentation these high-order allocations may fail, preventing the
> driver from creating queues at probe time or when reconfiguring
> channels, ring parameters or MTU at runtime.

## Form letter - net-next-closed

We have already submitted our pull request with net-next material for v7.1,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.

Please repost when net-next reopens after Apr 27th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: defer
pv-bot: closed

^ permalink raw reply

* Re: [PATCH net 1/1] mptcp: hold subflow request owners when cloning reqsk
From: Kuniyuki Iwashima @ 2026-04-17  1:58 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: Ren Wei, netdev, mptcp, davem, edumazet, kuba, pabeni, horms,
	ncardwell, dsahern, martineau, geliang, daniel, kafai, yuantan098,
	yifanwucs, tomapufckgml, bird, caoruide123, enjou1224z
In-Reply-To: <cb1c7e8e-2d64-4a2d-9862-e5c4f544144e@kernel.org>

On Thu, Apr 16, 2026 at 2:18 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Kuniyuki,
>
> Thank you for your reply!
>
> 16 Apr 2026 20:48:58 Kuniyuki Iwashima <kuniyu@google.com>:
>
> > On Thu, Apr 16, 2026 at 10:45 AM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Ren,
> >>
> >> On 15/04/2026 11:31, Ren Wei wrote:
> >>> From: Ruide Cao <caoruide123@gmail.com>
> >>>
> >>> TCP request migration clones pending request sockets with
> >>> inet_reqsk_clone(). For MPTCP MP_JOIN requests this raw-copies
> >>> subflow_req->msk, but the cloned request does not take a new reference.
> >>>
> >>> Both the original and the cloned request can later drop the same msk in
> >>> subflow_req_destructor(), and a migrated request may keep a dangling msk
> >>> pointer after the original owner has already been released.
> >>>
> >>> Add a request_sock clone callback and let MPTCP grab a reference for cloned
> >>> subflow requests that carry an msk. This keeps ownership balanced across
> >>> both successful migrations and failed clone/insert paths without changing
> >>> other protocols.
>
> (...)
>
> >>> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> >>> index e961936b6be7..140a9e96ad58 100644
> >>> --- a/net/ipv4/inet_connection_sock.c
> >>> +++ b/net/ipv4/inet_connection_sock.c
> >>> @@ -954,6 +954,9 @@ static struct request_sock *inet_reqsk_clone(struct request_sock *req,
> >>>       if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(nreq)->tfo_listener)
> >>>               rcu_assign_pointer(tcp_sk(nreq->sk)->fastopen_rsk, nreq);
> >>
> >> (Maybe TCP with fastopen could be this other user to call
> >> rcu_assign_pointer()? (net-next material))
> >>
> >>> +     if (req->rsk_ops->init_clone)
> >>> +             req->rsk_ops->init_clone(req, nreq);
> >
> > I think a simple direct call is better.
> >
> > #ifdef CONFIG_MPTCP
> >     if (tcp_rsk(req)->is_mptcp)
> >         mptcp_reqsk_clone(nreq);
> > #endif
>
> Fine by me!
>
> I guess it is needed to check the protocol, similar to what is fine with
> TFO above:
>
>   if (sk->sk_protocol == IPPROTO_TCP) {

I don't remember why I added this check for TFO,
but I think this is not needed.  The migration feature was
not supported by DCCP and it's already removed anyway.
I'll remove the check once net-next opens.


>       if TFO
>           ...
>       if MPTCP (+ifdef)
>           ...
>   }
>
> Cheers,
> Matt

^ permalink raw reply

* Re: [PATCH bpf v2 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks
From: KaFai Wan @ 2026-04-17  1:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: daniel, john.fastabend, sdf, ast, andrii, eddyz87, memxor, song,
	yonghong.song, jolsa, davem, edumazet, kuba, pabeni, horms, shuah,
	jiayuan.chen, bpf, netdev, linux-kernel, linux-kselftest,
	Quan Sun, Yinhao Hu, Kaiyan Mei
In-Reply-To: <2026416172759.GIuu.martin.lau@linux.dev>

On Thu, 2026-04-16 at 10:35 -0700, Martin KaFai Lau wrote:
> On Thu, Apr 16, 2026 at 07:23:07PM +0800, KaFai Wan wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fcfcb72663ca..911ff04bca5a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5833,6 +5833,11 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >  	if (!is_locked_tcp_sock_ops(bpf_sock))
> >  		return -EOPNOTSUPP;
> >  
> > +	if ((bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB ||
> > +	     bpf_sock->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB) &&
> > +	    IS_ENABLED(CONFIG_INET) && level == SOL_TCP && optname == TCP_NODELAY)
> 
> IS_ENABLED(CONFIG_INET) is unnecessary.

ok, will remove in next version.
> 
> pw-bot: cr
> 
> > +		return -EOPNOTSUPP;
> > +
> >  	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> >  }
> >  
> > -- 
> > 2.43.0
> > 

-- 
Thanks,
KaFai

^ permalink raw reply

* Re: [PATCH for-7.1-fixes 1/2] rhashtable: add no_sync_grow option
From: Herbert Xu @ 2026-04-17  1:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thomas Graf, David Vernet, Andrea Righi, Changwoo Min,
	Emil Tsalapatis, linux-crypto, sched-ext, linux-kernel,
	Florian Westphal, netdev
In-Reply-To: <20260417002449.2290577-1-tj@kernel.org>

On Thu, Apr 16, 2026 at 02:24:48PM -1000, Tejun Heo wrote:
> 
> The follow-up sched_ext patch is a fix targeting sched_ext/for-7.1-fixes
> which I'd like to send Linus's way sooner than later. Would it be okay
> to route both patches through sched_ext/for-7.1-fixes? If you'd prefer
> to route the rhashtable change differently, that works too. Please let
> me know, thanks.

As I said earlier, we should work out if this is really needed or
not.

But if it is needed, we did have this feature before.  It's called
insecure_elasticity which was removed because we moved all its users
off to better solutions.

We can always bring it back.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH v5 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Jakub Kicinski @ 2026-04-17  1:04 UTC (permalink / raw)
  To: Kubalewski, Arkadiusz
  Cc: Vecera, Ivan, vadim.fedorenko@linux.dev, edumazet@google.com,
	netdev@vger.kernel.org, richardcochran@gmail.com,
	donald.hunter@gmail.com, linux-kernel@vger.kernel.org,
	davem@davemloft.net, Prathosh.Satish@microchip.com,
	andrew+netdev@lunn.ch, intel-wired-lan@lists.osuosl.org,
	horms@kernel.org, Kitszel, Przemyslaw, Nguyen, Anthony L,
	pabeni@redhat.com, jiri@resnulli.us
In-Reply-To: <IA0PR11MB7378FF7BF4EA32C1F89372C19B232@IA0PR11MB7378.namprd11.prod.outlook.com>

On Thu, 16 Apr 2026 18:26:11 +0000 Kubalewski, Arkadiusz wrote:
> >> This HW doesn't use EEC DPLL signal to feed MAC clock, as DPLL is
> >> external from NIC point of view. Only 2 signals from such external DPLL
> >> device are used by NIC:
> >> - synce (a single source for all those TXC per-port DPLL device)
> >> - time_ref (a source for the TS_PLL - which drives PTP timer)  
> >
> >No bypass? The PLL is actually in the loop? oof, this is beyond
> >my understanding of clocks and signals :S
>
> TBH, I am not entirely sure what do you mean with MAC PLL into bypass
> mode, but the HW description I have provided is still true, the MAC is
> not fed with any DPLL provided signal here. Only port tx clocks PLLs and
> a timer PLL can use those.

The ASIC PLL IPs I managed to find had a bypass mode where the reference
/ input frequency still goes thru the dividers but the PLL circuit is
bypassed. I assumed that if we want to distribute a syntonized clock
across the network we would want as few PLL circuits in the paths as
possible and we'd use bypass (which would be relevant here since for
the target use case we wouldn't engage the PLL of the TXC). But this 
is 100% guesswork so I'm probably speaking gibberish.

> >> Well, 'floating' MUX type pin not connected to any dpll would require a
> >> lot of additional implementations, just to allow source selection, as we
> >> have tried it already.
> >>
> >> Wouldn't more generic name cause a DPLL purpose problem?  
> >
> >The old proposal in netdev family was to to have source selection
> >without creating a real mux. Not saying I'm dead set on that direction.
> 
> Yes, correct, it kept the list of dpll pins valid for source selection of
> tx clock within the netdev and control over it through RT netlink.
> That solution was rather simple but you requested to hack into dpll so we
> did here.
> 
> IMHO this is cleanest and simplest solution we could find to keep it
> within DPLL subsystem.
> 
> >> We still want to make sure that given DPLL device would serve the role
> >> of source selection for particular port where a source pin should be an
> >> output either on EEC dpll or some external signal generator but somehow
> >> related to SyncE or similar solutions.  
> >
> >Right, but adding a new "type" per location of the PLL (especially if
> >we lean into covering any ASIC PLL) may not scale, and opens us up to
> >"vendor X calls it Y" and "in design A clock is fed by pll type X and
> >in design B by type Y".
> 
> I was thinking that this is more like a purpose specific DPLL device, if
> someone would want something similar we would have to review it, right?

We would if it was a Ethernet MAC PLL, but if someone wanted to expose
whether some random PLL in their ASIC locks - are we adding a new type
for each one of those?

> >IIUC you do provide "linking" of the pins? netdev will have the MAC pin
> >assigned. Is the pin that connects the PLLs also annotated so that user
> >knows what's on the "other side"? Maybe the topology would be clear
> >enough from just that, and we don't have to add a TXC type.
> >Call the PLL "integrated" or something generic. User should be able to
> >trace the path of the signals?  
> 
> It depends, TX clock has one of external pins connected to external DPLL,
> but second is a board-level pin with ability to provide some external
> clock signal, the user would have to determine that purpose just based
> on the topology of one of the pins, which seems a bit problematic?
> I.e. if at some point there would be HW with only external non-DPLL
> connected pins?

Not sure I follow, TBH. To me the function of the "MAC PLL" is fairly
obvious from the fact that it has a pin exposed via rtnetlink. So it's
obviously a DPLL which can drive the Tx clock?

It's the function / relation / linking to the EEC DPLL that may not 
be obvious. But user can see how the pins connect they can get some
LLM to draw a diagram of a live system.. et voila :)

> I mean 'generic' type is something we could do, but as already mentioned,
> thought that we want a DPLL types specified/designed for some particular
> functions/tasks. 

I feel like we often get labels wrong the first time around, so if we
can defer adding them until later that'd make me happy..

^ permalink raw reply

* [PATCH net v2 11/12] idpf: fix xdp crash in soft reset error path
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Emil Tantilov, stable, Aleksandr Loktionov,
	Patryk Holda
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Emil Tantilov <emil.s.tantilov@intel.com>

NULL pointer dereference is reported in cases where idpf_vport_open()
fails during soft reset:

./xdpsock -i <inf> -q -r -N

[ 3179.186687] idpf 0000:83:00.0: Failed to initialize queue ids for vport 0: -12
[ 3179.276739] BUG: kernel NULL pointer dereference, address: 0000000000000010
[ 3179.277636] #PF: supervisor read access in kernel mode
[ 3179.278470] #PF: error_code(0x0000) - not-present page
[ 3179.279285] PGD 0
[ 3179.280083] Oops: Oops: 0000 [#1] SMP NOPTI
...
[ 3179.283997] Workqueue: events xp_release_deferred
[ 3179.284770] RIP: 0010:idpf_find_rxq_vec+0x17/0x30 [idpf]
...
[ 3179.291937] Call Trace:
[ 3179.292392]  <TASK>
[ 3179.292843]  idpf_qp_switch+0x25/0x820 [idpf]
[ 3179.293325]  idpf_xsk_pool_setup+0x7c/0x520 [idpf]
[ 3179.293803]  idpf_xdp+0x59/0x240 [idpf]
[ 3179.294275]  xp_disable_drv_zc+0x62/0xb0
[ 3179.294743]  xp_clear_dev+0x40/0xb0
[ 3179.295198]  xp_release_deferred+0x1f/0xa0
[ 3179.295648]  process_one_work+0x226/0x730
[ 3179.296106]  worker_thread+0x19e/0x340
[ 3179.296557]  ? __pfx_worker_thread+0x10/0x10
[ 3179.297009]  kthread+0xf4/0x130
[ 3179.297459]  ? __pfx_kthread+0x10/0x10
[ 3179.297910]  ret_from_fork+0x32c/0x410
[ 3179.298361]  ? __pfx_kthread+0x10/0x10
[ 3179.298702]  ret_from_fork_asm+0x1a/0x30

Fix the error handling of the soft reset in idpf_xdp_setup_prog() by
restoring the vport->xdp_prog to the old value. This avoids referencing
the orphaned prog that was copied to vport->xdp_prog in the soft reset
and prevents subsequent false positive by idpf_xdp_enabled().

Update the restart check in idpf_xsk_pool_setup() to use IDPF_VPORT_UP bit
instead of netif_running(). The idpf_vport_stop/start() calls will not
update the __LINK_STATE_START bit, making this test a false positive
should the soft reset fail.

Fixes: 3d57b2c00f09 ("idpf: add XSk pool initialization")
Cc: stable@vger.kernel.org
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/idpf/xdp.c | 1 +
 drivers/net/ethernet/intel/idpf/xsk.c | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index cbccd4546768..18a6e7062863 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -488,6 +488,7 @@ static int idpf_xdp_setup_prog(struct idpf_vport *vport,
 				   "Could not reopen the vport after XDP setup");
 
 		cfg->user_config.xdp_prog = old;
+		vport->xdp_prog = old;
 		old = prog;
 	}
 
diff --git a/drivers/net/ethernet/intel/idpf/xsk.c b/drivers/net/ethernet/intel/idpf/xsk.c
index d95d3efdfd36..3d8c430efd2b 100644
--- a/drivers/net/ethernet/intel/idpf/xsk.c
+++ b/drivers/net/ethernet/intel/idpf/xsk.c
@@ -553,6 +553,7 @@ int idpf_xskrq_poll(struct idpf_rx_queue *rxq, u32 budget)
 
 int idpf_xsk_pool_setup(struct idpf_vport *vport, struct netdev_bpf *bpf)
 {
+	const struct idpf_netdev_priv *np = netdev_priv(vport->netdev);
 	struct xsk_buff_pool *pool = bpf->xsk.pool;
 	u32 qid = bpf->xsk.queue_id;
 	bool restart;
@@ -568,7 +569,8 @@ int idpf_xsk_pool_setup(struct idpf_vport *vport, struct netdev_bpf *bpf)
 		return -EINVAL;
 	}
 
-	restart = idpf_xdp_enabled(vport) && netif_running(vport->netdev);
+	restart = idpf_xdp_enabled(vport) &&
+		  test_bit(IDPF_VPORT_UP, np->state);
 	if (!restart)
 		goto pool;
 

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 12/12] e1000e: Unroll PTP in probe error handling
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Matt Vollrath, Avigail Dahan
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Matt Vollrath <tactii@gmail.com>

If probe fails after registering the PTP clock and its delayed work,
these resources must be released.

This was not an issue until a 2016 fix moved the e1000e_ptp_init() call
before the jump to err_register.

Fixes: aa524b66c5ef ("e1000e: don't modify SYSTIM registers during SIOCSHWTSTAMP ioctl")
Signed-off-by: Matt Vollrath <tactii@gmail.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9befdacd6730..7ce0cc8ab8f4 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -7706,6 +7706,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 err_register:
 	if (!(adapter->flags & FLAG_HAS_AMT))
 		e1000e_release_hw_control(adapter);
+	e1000e_ptp_remove(adapter);
 err_eeprom:
 	if (hw->phy.ops.check_reset_block && !hw->phy.ops.check_reset_block(hw))
 		e1000_phy_hw_reset(&adapter->hw);

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 10/12] iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Petr Oros, Aleksandr Loktionov, Paul Menzel,
	Rafal Romanowski
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Petr Oros <poros@redhat.com>

The IAVF_RXD_LEGACY_L2TAG2_M mask was incorrectly defined as
GENMASK_ULL(63, 32), extracting 32 bits from qw2 instead of the
16-bit VLAN tag. In the legacy Rx descriptor layout, the 2nd L2TAG2
(VLAN tag) occupies bits 63:48 of qw2, not 63:32.

The oversized mask causes FIELD_GET to return a 32-bit value where the
actual VLAN tag sits in bits 31:16. When this value is passed to
iavf_receive_skb() as a u16 parameter, it gets truncated to the lower
16 bits (which contain the 1st L2TAG2, typically zero). As a result,
__vlan_hwaccel_put_tag() is never called and software VLAN interfaces
on VFs receive no traffic.

This affects VFs behind ice PF (VIRTCHNL VLAN v2) when the PF
advertises VLAN stripping into L2TAG2_2 and legacy descriptors are
used.

The flex descriptor path already uses the correct mask
(IAVF_RXD_FLEX_L2TAG2_2_M = GENMASK_ULL(63, 48)).

Reproducer:
 1. Create 2 VFs on ice PF (echo 2 > sriov_numvfs)
 2. Disable spoofchk on both VFs
 3. Move each VF into a separate network namespace
 4. On each VF: create VLAN interface (e.g. vlan 198), assign IP,
    bring up
 5. Set rx-vlan-offload OFF on both VFs
 6. Ping between VLAN interfaces -> expect PASS
    (VLAN tag stays in packet data, kernel matches in-band)
 7. Set rx-vlan-offload ON on both VFs
 8. Ping between VLAN interfaces -> expect FAIL if bug present
    (HW strips VLAN tag into descriptor L2TAG2 field, wrong mask
    extracts bits 47:32 instead of 63:48, truncated to u16 -> zero,
    __vlan_hwaccel_put_tag() never called, packet delivered to parent
    interface, not VLAN interface)

The reproducer requires legacy Rx descriptors. On modern ice + iavf
with full PTP support, flex descriptors are always negotiated and the
buggy legacy path is never reached. Flex descriptors require all of:
 - CONFIG_PTP_1588_CLOCK enabled
 - VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC granted by PF
 - PTP capabilities negotiated (VIRTCHNL_VF_CAP_PTP)
 - VIRTCHNL_1588_PTP_CAP_RX_TSTAMP supported
 - VIRTCHNL_RXDID_2_FLEX_SQ_NIC present in DDP profile

If any condition is not met, iavf_select_rx_desc_format() falls back
to legacy descriptors (RXDID=1) and the wrong L2TAG2 mask is hit.

Fixes: 2dc8e7c36d80 ("iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf_type.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_type.h b/drivers/net/ethernet/intel/iavf/iavf_type.h
index 1d8cf29cb65a..5bb1de1cfd33 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_type.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_type.h
@@ -277,7 +277,7 @@ struct iavf_rx_desc {
 /* L2 Tag 2 Presence */
 #define IAVF_RXD_LEGACY_L2TAG2P_M		BIT(0)
 /* Stripped S-TAG VLAN from the receive packet */
-#define IAVF_RXD_LEGACY_L2TAG2_M		GENMASK_ULL(63, 32)
+#define IAVF_RXD_LEGACY_L2TAG2_M		GENMASK_ULL(63, 48)
 /* Stripped S-TAG VLAN from the receive packet */
 #define IAVF_RXD_FLEX_L2TAG2_2_M		GENMASK_ULL(63, 48)
 /* The packet is a UDP tunneled packet */

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 08/12] ice: fix potential NULL pointer deref in error path of ice_set_ringparam()
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Kohei Enju, Paul Greenwalt, Rinitha S
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Kohei Enju <kohei@enjuk.jp>

ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without
clearing ICE_TX_RING_FLAGS_TXTIME bit.
When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent
ice_setup_tx_ring() call fails, a NULL pointer dereference could happen
in the unwinding sequence:

ice_clean_tx_ring()
-> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set)
-> ice_free_tx_tstamp_ring()
  -> ice_free_tstamp_ring()
    -> tstamp_ring->desc (NULL deref)

Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue.

Note that this potential issue is found by manual code review.
Compile test only since unfortunately I don't have E830 devices.

Fixes: ccde82e90946 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index e6a20af6f63d..f28416a707d7 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3290,6 +3290,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		tx_rings[i].desc = NULL;
 		tx_rings[i].tx_buf = NULL;
 		tx_rings[i].tstamp_ring = NULL;
+		clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_rings[i].flags);
 		tx_rings[i].tx_tstamps = &pf->ptp.port.tx;
 		err = ice_setup_tx_ring(&tx_rings[i]);
 		if (err) {

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 09/12] i40e: don't advertise IFF_SUPP_NOFCS
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Kohei Enju, Aleksandr Loktionov,
	Sunitha Mekala
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Kohei Enju <kohei@enjuk.jp>

i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS
socket option. However, this option is silently ignored, as the driver
does not check skb->no_fcs, and always enables FCS insertion offload.

Fix this by removing the advertisement of IFF_SUPP_NOFCS.

This behavior can be reproduced with a simple AF_PACKET socket:

  import socket
  s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
  s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS
  s.bind(("eth0", 0))
  s.send(b'\xff' * 64)

Previously, send() succeeds but the driver ignores SO_NOFCS.
With this change, send() fails with -EPROTONOSUPPORT, as expected.

Fixes: 41c445ff0f48 ("i40e: main driver core")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 926d001b2150..028bd500603a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -13783,7 +13783,6 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
 	netdev->neigh_priv_len = sizeof(u32) * 4;
 
 	netdev->priv_flags |= IFF_UNICAST_FLT;
-	netdev->priv_flags |= IFF_SUPP_NOFCS;
 	/* Setup netdev TC information */
 	i40e_vsi_config_netdev_tc(vsi, vsi->tc_config.enabled_tc);
 

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 07/12] ice: fix race condition in TX timestamp ring cleanup
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Keita Morisaki, Aleksandr Loktionov,
	Rinitha S
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Keita Morisaki <kmta1236@gmail.com>

Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.

ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.

  CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
  --------------------------------|---------------------------------
  tx_ring->tstamp_ring = NULL     |
                                  | ice_is_txtime_cfg() -> true
                                  | tstamp_ring = tx_ring->tstamp_ring
                                  | tstamp_ring->count  // NULL deref!
  flags &= ~ICE_TX_FLAGS_TXTIME   |

Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
   NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
   flag read before the pointer read, using READ_ONCE() for the
   pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
   atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
   operations throughout the driver:
   - ICE_TX_RING_FLAGS_XDP
   - ICE_TX_RING_FLAGS_VLAN_L2TAG1
   - ICE_TX_RING_FLAGS_VLAN_L2TAG2
   - ICE_TX_RING_FLAGS_TXTIME

Fixes: ccde82e909467 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h         |  4 ++--
 drivers/net/ethernet/intel/ice/ice_txrx.h    | 16 ++++++++++------
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  2 +-
 drivers/net/ethernet/intel/ice/ice_lib.c     |  4 ++--
 drivers/net/ethernet/intel/ice/ice_txrx.c    | 23 ++++++++++++++++-------
 5 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index eb3a48330cc1..725b130dd3a2 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -753,7 +753,7 @@ static inline bool ice_is_xdp_ena_vsi(struct ice_vsi *vsi)
 
 static inline void ice_set_ring_xdp(struct ice_tx_ring *ring)
 {
-	ring->flags |= ICE_TX_FLAGS_RING_XDP;
+	set_bit(ICE_TX_RING_FLAGS_XDP, ring->flags);
 }
 
 /**
@@ -778,7 +778,7 @@ static inline bool ice_is_txtime_ena(const struct ice_tx_ring *ring)
  */
 static inline bool ice_is_txtime_cfg(const struct ice_tx_ring *ring)
 {
-	return !!(ring->flags & ICE_TX_FLAGS_TXTIME);
+	return test_bit(ICE_TX_RING_FLAGS_TXTIME, ring->flags);
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b6547e1b7c42..5e517f219379 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -212,6 +212,14 @@ enum ice_rx_dtype {
 	ICE_RX_DTYPE_SPLIT_ALWAYS	= 2,
 };
 
+enum ice_tx_ring_flags {
+	ICE_TX_RING_FLAGS_XDP,
+	ICE_TX_RING_FLAGS_VLAN_L2TAG1,
+	ICE_TX_RING_FLAGS_VLAN_L2TAG2,
+	ICE_TX_RING_FLAGS_TXTIME,
+	ICE_TX_RING_FLAGS_NBITS,
+};
+
 struct ice_pkt_ctx {
 	u64 cached_phctime;
 	__be16 vlan_proto;
@@ -352,11 +360,7 @@ struct ice_tx_ring {
 	u16 count;			/* Number of descriptors */
 	u16 q_index;			/* Queue number of ring */
 
-	u8 flags;
-#define ICE_TX_FLAGS_RING_XDP		BIT(0)
-#define ICE_TX_FLAGS_RING_VLAN_L2TAG1	BIT(1)
-#define ICE_TX_FLAGS_RING_VLAN_L2TAG2	BIT(2)
-#define ICE_TX_FLAGS_TXTIME		BIT(3)
+	DECLARE_BITMAP(flags, ICE_TX_RING_FLAGS_NBITS);
 
 	struct xsk_buff_pool *xsk_pool;
 
@@ -398,7 +402,7 @@ static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring)
 
 static inline bool ice_ring_is_xdp(struct ice_tx_ring *ring)
 {
-	return !!(ring->flags & ICE_TX_FLAGS_RING_XDP);
+	return test_bit(ICE_TX_RING_FLAGS_XDP, ring->flags);
 }
 
 enum ice_container_type {
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
index bd77f1c001ee..16aa25535152 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -943,7 +943,7 @@ ice_tx_prepare_vlan_flags_dcb(struct ice_tx_ring *tx_ring,
 		/* if this is not already set it means a VLAN 0 + priority needs
 		 * to be offloaded
 		 */
-		if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2)
+		if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags))
 			first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN;
 		else
 			first->tx_flags |= ICE_TX_FLAGS_HW_VLAN;
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 689c6025ea82..837b71b7b2b7 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1412,9 +1412,9 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
 		ring->count = vsi->num_tx_desc;
 		ring->txq_teid = ICE_INVAL_TEID;
 		if (dvm_ena)
-			ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG2;
+			set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, ring->flags);
 		else
-			ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG1;
+			set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG1, ring->flags);
 		WRITE_ONCE(vsi->tx_rings[i], ring);
 	}
 
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 7be9c062949b..4ca1a0602307 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -190,9 +190,10 @@ void ice_free_tstamp_ring(struct ice_tx_ring *tx_ring)
 void ice_free_tx_tstamp_ring(struct ice_tx_ring *tx_ring)
 {
 	ice_free_tstamp_ring(tx_ring);
+	clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags);
+	smp_wmb();	/* order flag clear before pointer NULL */
 	kfree_rcu(tx_ring->tstamp_ring, rcu);
-	tx_ring->tstamp_ring = NULL;
-	tx_ring->flags &= ~ICE_TX_FLAGS_TXTIME;
+	WRITE_ONCE(tx_ring->tstamp_ring, NULL);
 }
 
 /**
@@ -405,7 +406,7 @@ static int ice_alloc_tstamp_ring(struct ice_tx_ring *tx_ring)
 	tx_ring->tstamp_ring = tstamp_ring;
 	tstamp_ring->desc = NULL;
 	tstamp_ring->count = ice_calc_ts_ring_count(tx_ring);
-	tx_ring->flags |= ICE_TX_FLAGS_TXTIME;
+	set_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags);
 	return 0;
 }
 
@@ -1521,13 +1522,20 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
 		return;
 
 	if (ice_is_txtime_cfg(tx_ring)) {
-		struct ice_tstamp_ring *tstamp_ring = tx_ring->tstamp_ring;
-		u32 tstamp_count = tstamp_ring->count;
-		u32 j = tstamp_ring->next_to_use;
+		struct ice_tstamp_ring *tstamp_ring;
+		u32 tstamp_count, j;
 		struct ice_ts_desc *ts_desc;
 		struct timespec64 ts;
 		u32 tstamp;
 
+		smp_rmb();	/* order flag read before pointer read */
+		tstamp_ring = READ_ONCE(tx_ring->tstamp_ring);
+		if (unlikely(!tstamp_ring))
+			goto ring_kick;
+
+		tstamp_count = tstamp_ring->count;
+		j = tstamp_ring->next_to_use;
+
 		ts = ktime_to_timespec64(first->skb->tstamp);
 		tstamp = ts.tv_nsec >> ICE_TXTIME_CTX_RESOLUTION_128NS;
 
@@ -1555,6 +1563,7 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
 		tstamp_ring->next_to_use = j;
 		writel_relaxed(j, tstamp_ring->tail);
 	} else {
+ring_kick:
 		writel_relaxed(i, tx_ring->tail);
 	}
 	return;
@@ -1814,7 +1823,7 @@ ice_tx_prepare_vlan_flags(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first)
 	 */
 	if (skb_vlan_tag_present(skb)) {
 		first->vid = skb_vlan_tag_get(skb);
-		if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2)
+		if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags))
 			first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN;
 		else
 			first->tx_flags |= ICE_TX_FLAGS_HW_VLAN;

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 05/12] ice: fix PHY config on media change with link-down-on-close
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Paul Greenwalt, Przemek Kitszel,
	Aleksandr Loktionov, Sunitha Mekala
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Paul Greenwalt <paul.greenwalt@intel.com>

Commit 1a3571b5938c ("ice: restore PHY settings on media insertion")
introduced separate flows for setting PHY configuration on media
present: ice_configure_phy() when link-down-on-close is disabled, and
ice_force_phys_link_state() when enabled. The latter incorrectly uses
the previous configuration even after module change, causing link
issues such as wrong speed or no link.

Unify PHY configuration into a single ice_phy_cfg() function with a
link_en parameter, ensuring PHY capabilities are always fetched fresh
from hardware.

Fixes: 1a3571b5938c ("ice: restore PHY settings on media insertion")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 121 +++++++-----------------------
 1 file changed, 27 insertions(+), 94 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 3c36e3641b9e..ce3a0afe302d 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1922,82 +1922,6 @@ static void ice_handle_mdd_event(struct ice_pf *pf)
 	ice_print_vfs_mdd_events(pf);
 }
 
-/**
- * ice_force_phys_link_state - Force the physical link state
- * @vsi: VSI to force the physical link state to up/down
- * @link_up: true/false indicates to set the physical link to up/down
- *
- * Force the physical link state by getting the current PHY capabilities from
- * hardware and setting the PHY config based on the determined capabilities. If
- * link changes a link event will be triggered because both the Enable Automatic
- * Link Update and LESM Enable bits are set when setting the PHY capabilities.
- *
- * Returns 0 on success, negative on failure
- */
-static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up)
-{
-	struct ice_aqc_get_phy_caps_data *pcaps;
-	struct ice_aqc_set_phy_cfg_data *cfg;
-	struct ice_port_info *pi;
-	struct device *dev;
-	int retcode;
-
-	if (!vsi || !vsi->port_info || !vsi->back)
-		return -EINVAL;
-	if (vsi->type != ICE_VSI_PF)
-		return 0;
-
-	dev = ice_pf_to_dev(vsi->back);
-
-	pi = vsi->port_info;
-
-	pcaps = kzalloc_obj(*pcaps);
-	if (!pcaps)
-		return -ENOMEM;
-
-	retcode = ice_aq_get_phy_caps(pi, false, ICE_AQC_REPORT_ACTIVE_CFG, pcaps,
-				      NULL);
-	if (retcode) {
-		dev_err(dev, "Failed to get phy capabilities, VSI %d error %d\n",
-			vsi->vsi_num, retcode);
-		retcode = -EIO;
-		goto out;
-	}
-
-	/* No change in link */
-	if (link_up == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) &&
-	    link_up == !!(pi->phy.link_info.link_info & ICE_AQ_LINK_UP))
-		goto out;
-
-	/* Use the current user PHY configuration. The current user PHY
-	 * configuration is initialized during probe from PHY capabilities
-	 * software mode, and updated on set PHY configuration.
-	 */
-	cfg = kmemdup(&pi->phy.curr_user_phy_cfg, sizeof(*cfg), GFP_KERNEL);
-	if (!cfg) {
-		retcode = -ENOMEM;
-		goto out;
-	}
-
-	cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT;
-	if (link_up)
-		cfg->caps |= ICE_AQ_PHY_ENA_LINK;
-	else
-		cfg->caps &= ~ICE_AQ_PHY_ENA_LINK;
-
-	retcode = ice_aq_set_phy_cfg(&vsi->back->hw, pi, cfg, NULL);
-	if (retcode) {
-		dev_err(dev, "Failed to set phy config, VSI %d error %d\n",
-			vsi->vsi_num, retcode);
-		retcode = -EIO;
-	}
-
-	kfree(cfg);
-out:
-	kfree(pcaps);
-	return retcode;
-}
-
 /**
  * ice_init_nvm_phy_type - Initialize the NVM PHY type
  * @pi: port info structure
@@ -2066,7 +1990,7 @@ static void ice_init_link_dflt_override(struct ice_port_info *pi)
  * first time media is available. The ICE_LINK_DEFAULT_OVERRIDE_PENDING state
  * is used to indicate that the user PHY cfg default override is initialized
  * and the PHY has not been configured with the default override settings. The
- * state is set here, and cleared in ice_configure_phy the first time the PHY is
+ * state is set here, and cleared in ice_phy_cfg the first time the PHY is
  * configured.
  *
  * This function should be called only if the FW doesn't support default
@@ -2172,14 +2096,18 @@ static int ice_init_phy_user_cfg(struct ice_port_info *pi)
 }
 
 /**
- * ice_configure_phy - configure PHY
+ * ice_phy_cfg - configure PHY
  * @vsi: VSI of PHY
+ * @link_en: true/false indicates to set link to enable/disable
  *
  * Set the PHY configuration. If the current PHY configuration is the same as
- * the curr_user_phy_cfg, then do nothing to avoid link flap. Otherwise
- * configure the based get PHY capabilities for topology with media.
+ * the curr_user_phy_cfg and link_en hasn't changed, then do nothing to avoid
+ * link flap. Otherwise configure the PHY based get PHY capabilities for
+ * topology with media and link_en.
+ *
+ * Return: 0 on success, negative on failure
  */
-static int ice_configure_phy(struct ice_vsi *vsi)
+static int ice_phy_cfg(struct ice_vsi *vsi, bool link_en)
 {
 	struct device *dev = ice_pf_to_dev(vsi->back);
 	struct ice_port_info *pi = vsi->port_info;
@@ -2199,9 +2127,6 @@ static int ice_configure_phy(struct ice_vsi *vsi)
 	    phy->link_info.topo_media_conflict == ICE_AQ_LINK_TOPO_UNSUPP_MEDIA)
 		return -EPERM;
 
-	if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags))
-		return ice_force_phys_link_state(vsi, true);
-
 	pcaps = kzalloc_obj(*pcaps);
 	if (!pcaps)
 		return -ENOMEM;
@@ -2215,10 +2140,8 @@ static int ice_configure_phy(struct ice_vsi *vsi)
 		goto done;
 	}
 
-	/* If PHY enable link is configured and configuration has not changed,
-	 * there's nothing to do
-	 */
-	if (pcaps->caps & ICE_AQC_PHY_EN_LINK &&
+	/* Configuration has not changed. There's nothing to do. */
+	if (link_en == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) &&
 	    ice_phy_caps_equals_cfg(pcaps, &phy->curr_user_phy_cfg))
 		goto done;
 
@@ -2282,8 +2205,12 @@ static int ice_configure_phy(struct ice_vsi *vsi)
 	 */
 	ice_cfg_phy_fc(pi, cfg, phy->curr_user_fc_req);
 
-	/* Enable link and link update */
-	cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT | ICE_AQ_PHY_ENA_LINK;
+	/* Enable/Disable link and link update */
+	cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT;
+	if (link_en)
+		cfg->caps |= ICE_AQ_PHY_ENA_LINK;
+	else
+		cfg->caps &= ~ICE_AQ_PHY_ENA_LINK;
 
 	err = ice_aq_set_phy_cfg(&pf->hw, pi, cfg, NULL);
 	if (err)
@@ -2336,7 +2263,7 @@ static void ice_check_media_subtask(struct ice_pf *pf)
 		    test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags))
 			return;
 
-		err = ice_configure_phy(vsi);
+		err = ice_phy_cfg(vsi, true);
 		if (!err)
 			clear_bit(ICE_FLAG_NO_MEDIA, pf->flags);
 
@@ -4892,9 +4819,15 @@ static int ice_init_link(struct ice_pf *pf)
 
 		if (!test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags)) {
 			struct ice_vsi *vsi = ice_get_main_vsi(pf);
+			struct ice_link_default_override_tlv *ldo;
+			bool link_en;
+
+			ldo = &pf->link_dflt_override;
+			link_en = !(ldo->options &
+				    ICE_LINK_OVERRIDE_AUTO_LINK_DIS);
 
 			if (vsi)
-				ice_configure_phy(vsi);
+				ice_phy_cfg(vsi, link_en);
 		}
 	} else {
 		set_bit(ICE_FLAG_NO_MEDIA, pf->flags);
@@ -9707,7 +9640,7 @@ int ice_open_internal(struct net_device *netdev)
 			}
 		}
 
-		err = ice_configure_phy(vsi);
+		err = ice_phy_cfg(vsi, true);
 		if (err) {
 			netdev_err(netdev, "Failed to set physical link up, error %d\n",
 				   err);
@@ -9748,7 +9681,7 @@ int ice_stop(struct net_device *netdev)
 	}
 
 	if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags)) {
-		int link_err = ice_force_phys_link_state(vsi, false);
+		int link_err = ice_phy_cfg(vsi, false);
 
 		if (link_err) {
 			if (link_err == -ENOMEDIUM)

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 06/12] ice: fix ICE_AQ_LINK_SPEED_M for 200G
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Paul Greenwalt, Aleksandr Loktionov,
	Simon Horman, Sunitha Mekala
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Paul Greenwalt <paul.greenwalt@intel.com>

When setting PHY configuration during driver initialization, 200G link
speed is not being advertised even when the PHY is capable. This is
because the get PHY capabilities link speed response is being masked by
ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.

ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
that it covers all defined link speed bits including 200G.

Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 859e9c66f3e7..3cbb1b0582e3 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1252,7 +1252,7 @@ struct ice_aqc_get_link_status_data {
 #define ICE_AQ_LINK_PWR_QSFP_CLASS_3	2
 #define ICE_AQ_LINK_PWR_QSFP_CLASS_4	3
 	__le16 link_speed;
-#define ICE_AQ_LINK_SPEED_M		0x7FF
+#define ICE_AQ_LINK_SPEED_M		GENMASK(11, 0)
 #define ICE_AQ_LINK_SPEED_10MB		BIT(0)
 #define ICE_AQ_LINK_SPEED_100MB		BIT(1)
 #define ICE_AQ_LINK_SPEED_1000MB	BIT(2)

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

* [PATCH net v2 04/12] ice: fix double-free of tx_buf skb
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Michal Schmidt
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Michal Schmidt <mschmidt@redhat.com>

If ice_tso() or ice_tx_csum() fail, the error path in
ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points
to it and is marked as valid (ICE_TX_BUF_SKB).
'next_to_use' remains unchanged, so the potential problem will
likely fix itself when the next packet is transmitted and the tx_buf
gets overwritten. But if there is no next packet and the interface is
brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf()
will find the tx_buf and free the skb for the second time.

The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error
path, so that ice_unmap_and_free_tx_buf().
Move the initialization of 'first' up, to ensure it's already valid in
case we hit the linearization error path.

The bug was spotted by AI while I had it looking for something else.
It also proposed an initial version of the patch.

I reproduced the bug and tested the fix by adding code to inject
failures, on a build with KASAN.

I looked for similar bugs in related Intel drivers and did not find any.

Fixes: d76a60ba7afb ("ice: Add support for VLANs and offloads")
Assisted-by: Claude:claude-4.6-opus-high Cursor
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index a2cd4cf37734..7be9c062949b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -2158,6 +2158,9 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)

 	ice_trace(xmit_frame_ring, tx_ring, skb);

+	/* record the location of the first descriptor for this packet */
+	first = &tx_ring->tx_buf[tx_ring->next_to_use];
+
 	count = ice_xmit_desc_count(skb);
 	if (ice_chk_linearize(skb, count)) {
 		if (__skb_linearize(skb))
@@ -2183,8 +2186,6 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)

 	offload.tx_ring = tx_ring;

-	/* record the location of the first descriptor for this packet */
-	first = &tx_ring->tx_buf[tx_ring->next_to_use];
 	first->skb = skb;
 	first->type = ICE_TX_BUF_SKB;
 	first->bytecount = max_t(unsigned int, skb->len, ETH_ZLEN);
@@ -2249,6 +2250,7 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)
 out_drop:
 	ice_trace(xmit_frame_ring_drop, tx_ring, skb);
 	dev_kfree_skb_any(skb);
+	first->type = ICE_TX_BUF_EMPTY;
 	return NETDEV_TX_OK;
 }

-- 
2.54.0.rc2.531.gaf818d63126a

^ permalink raw reply related

* [PATCH net v2 03/12] ice: fix double free in ice_sf_eth_activate() error path
From: Jacob Keller @ 2026-04-17  0:53 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Jacob Keller, Guangshuo Li, stable, Aleksandr Loktionov,
	Simon Horman
In-Reply-To: <20260416-iwl-net-submission-2026-04-14-v2-0-686c33c9828d@intel.com>

From: Guangshuo Li <lgs201920130244@gmail.com>

When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to
aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev).

The device release callback ice_sf_dev_release() frees sf_dev, but
the current error path falls through to sf_dev_free and calls
kfree(sf_dev) again, causing a double free.

Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but
avoid falling through to sf_dev_free after auxiliary_device_uninit().

Fixes: 13acc5c4cdbe ("ice: subfunction activation and base devlink ops")
Cc: stable@vger.kernel.org
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_sf_eth.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index 2cf04bc6edce..a730aa368c92 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -305,6 +305,8 @@ ice_sf_eth_activate(struct ice_dynamic_port *dyn_port,
 
 aux_dev_uninit:
 	auxiliary_device_uninit(&sf_dev->adev);
+	return err;
+
 sf_dev_free:
 	kfree(sf_dev);
 xa_erase:

-- 
2.54.0.rc2.531.gaf818d63126a


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox