Netdev List
 help / color / mirror / Atom feed
* [PATCH 6.1.y] Revert "wifi: cfg80211: stop NAN and P2P in cfg80211_leave"
From: guocai.he.cn @ 2026-04-14  2:16 UTC (permalink / raw)
  To: stable; +Cc: gregkh, johannes.berg, netdev, regressions,
	miriam.rachel.korenblit

From: Guocai He <guocai.he.cn@windriver.com>

This reverts commit 0c4f1c02d27a880b10b58c63f574f13bed4f711d which is commit 
e1696c8bd0056bc1a5f7766f58ac333adc203e8a upstream.

The reverted patch introduced a deadlock. The locking situation in mainline is 
totally different, so it is incorrect to directly backport the commit from mainline.

Signed-off-by: Guocai He <guocai.he.cn@windriver.com>
---
 net/wireless/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/wireless/core.c b/net/wireless/core.c
index e75326932c32..2a6a8bdfa724 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -1328,10 +1328,8 @@ void __cfg80211_leave(struct cfg80211_registered_device *rdev,
 		__cfg80211_leave_ocb(rdev, dev);
 		break;
 	case NL80211_IFTYPE_P2P_DEVICE:
-		cfg80211_stop_p2p_device(rdev, wdev);
-		break;
 	case NL80211_IFTYPE_NAN:
-		cfg80211_stop_nan(rdev, wdev);
+		/* cannot happen, has no netdev */
 		break;
 	case NL80211_IFTYPE_AP_VLAN:
 	case NL80211_IFTYPE_MONITOR:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v5 1/2] flow_dissector: do not dissect PPPoE PFC frames
From: Qingfang Deng @ 2026-04-14  2:13 UTC (permalink / raw)
  To: linux-ppp, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Qingfang Deng, Tony Nguyen,
	Guillaume Nault, Wojciech Drewek, netdev, linux-kernel
  Cc: Paul Mackerras, Jaco Kroon, James Carlson, Marcin Szycik

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the flow dissector driver has assumed an
uncompressed frame until the blamed commit.

During the review process of that commit [1], support for PFC is
suggested. However, having a compressed (1-byte) protocol field means
the subsequent PPP payload is shifted by one byte, causing 4-byte
misalignment for the network header and an unaligned access exception
on some architectures.

The exception can be reproduced by sending a PPPoE PFC frame to an
ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE
session is active on that interface:

$ 0   : 00000000 80c40000 00000000 85144817
$ 4   : 00000008 00000100 80a75758 81dc9bb8
$ 8   : 00000010 8087ae2c 0000003d 00000000
$12   : 000000e0 00000039 00000000 00000000
$16   : 85043240 80a75758 81dc9bb8 00006488
$20   : 0000002f 00000007 85144810 80a70000
$24   : 81d1bda0 00000000
$28   : 81dc8000 81dc9aa8 00000000 805ead08
Hi    : 00009d51
Lo    : 2163358a
epc   : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50
ra    : 805ead08 __skb_get_hash_net+0x74/0x12c
Status: 11000403        KERNEL EXL IE
Cause : 40800010 (ExcCode 04)
BadVA : 85144817
PrId  : 0001992f (MIPS 1004Kc)
Call Trace:
[<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50
[<805ead08>] __skb_get_hash_net+0x74/0x12c
[<805ef330>] get_rps_cpu+0x1b8/0x3fc
[<805fca70>] netif_receive_skb_list_internal+0x324/0x364
[<805fd120>] napi_complete_done+0x68/0x2a4
[<8058de5c>] mtk_napi_rx+0x228/0xfec
[<805fd398>] __napi_poll+0x3c/0x1c4
[<805fd754>] napi_threaded_poll_loop+0x234/0x29c
[<805fd848>] napi_threaded_poll+0x8c/0xb0
[<80053544>] kthread+0x104/0x12c
[<80002bd8>] ret_from_kernel_thread+0x14/0x1c

Code: 02d51821  1060045b  00000000 <8c640000> 3084000f  2c820005  144001a2  00042080  8e220000

To reduce the attack surface and maintain performance, do not process
PPPoE PFC frames.

[1] https://patch.msgid.link/20220630231016.GA392@debian.home
Fixes: 46126db9c861 ("flow_dissector: Add PPPoE dissectors")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
---
Changes in v5: drop byte-swap change
 Link to v4: https://lore.kernel.org/netdev/20260410033627.93786-1-qingfang.deng@linux.dev/

 net/core/flow_dissector.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1b61bb25ba0e..f9aaba554128 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1374,16 +1374,8 @@ bool __skb_flow_dissect(const struct net *net,
 			break;
 		}
 
-		/* least significant bit of the most significant octet
-		 * indicates if protocol field was compressed
-		 */
 		ppp_proto = ntohs(hdr->proto);
-		if (ppp_proto & 0x0100) {
-			ppp_proto = ppp_proto >> 8;
-			nhoff += PPPOE_SES_HLEN - 1;
-		} else {
-			nhoff += PPPOE_SES_HLEN;
-		}
+		nhoff += PPPOE_SES_HLEN;
 
 		if (ppp_proto == PPP_IP) {
 			proto = htons(ETH_P_IP);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net v5 2/2] pppoe: drop PFC frames
From: Qingfang Deng @ 2026-04-14  2:13 UTC (permalink / raw)
  To: linux-ppp, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Qingfang Deng, Breno Leitao,
	Sebastian Andrzej Siewior, Kees Cook, Kuniyuki Iwashima,
	Guillaume Nault, Eric Woudstra, Simon Horman, Sam Protsenko,
	netdev, linux-kernel
  Cc: Paul Mackerras, Jaco Kroon, James Carlson, Wojciech Drewek,
	Marcin Szycik
In-Reply-To: <20260414021353.23471-1-qingfang.deng@linux.dev>

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the current PPPoE driver assumes an
uncompressed (2-byte) protocol field. However, the generic PPP layer
function ppp_input() is not aware of the negotiation result, and still
accepts PFC frames.

If a peer with a broken implementation or an attacker sends a frame with
a compressed (1-byte) protocol field, the subsequent PPP payload is
shifted by one byte. This causes the network header to be 4-byte
misaligned, which may trigger unaligned access exceptions on some
architectures.

To reduce the attack surface, drop PPPoE PFC frames. Introduce
ppp_skb_is_compressed_proto() helper function to be used in both
ppp_generic.c and pppoe.c to avoid open-coding.

Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
---
Changes in v5: add Reviewed-by tag from Simon
 Link to v4: https://lore.kernel.org/netdev/20260410033627.93786-2-qingfang.deng@linux.dev/

 drivers/net/ppp/ppp_generic.c |  2 +-
 drivers/net/ppp/pppoe.c       |  8 +++++++-
 include/linux/ppp_defs.h      | 16 ++++++++++++++++
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index b097d1b38ac9..853da966ad46 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -2242,7 +2242,7 @@ ppp_do_recv(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
  */
 static void __ppp_decompress_proto(struct sk_buff *skb)
 {
-	if (skb->data[0] & 0x01)
+	if (ppp_skb_is_compressed_proto(skb))
 		*(u8 *)skb_push(skb, 1) = 0x00;
 }
 
diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index d546a7af0d54..bdd61c504a1c 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -393,7 +393,7 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (skb_mac_header_len(skb) < ETH_HLEN)
 		goto drop;
 
-	if (!pskb_may_pull(skb, sizeof(struct pppoe_hdr)))
+	if (!pskb_may_pull(skb, PPPOE_SES_HLEN))
 		goto drop;
 
 	ph = pppoe_hdr(skb);
@@ -403,6 +403,12 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (skb->len < len)
 		goto drop;
 
+	/* skb->data points to the PPP protocol header after skb_pull_rcsum.
+	 * Drop PFC frames.
+	 */
+	if (ppp_skb_is_compressed_proto(skb))
+		goto drop;
+
 	if (pskb_trim_rcsum(skb, len))
 		goto drop;
 
diff --git a/include/linux/ppp_defs.h b/include/linux/ppp_defs.h
index b7e57fdbd413..b1d1f46d7d3b 100644
--- a/include/linux/ppp_defs.h
+++ b/include/linux/ppp_defs.h
@@ -8,6 +8,7 @@
 #define _PPP_DEFS_H_
 
 #include <linux/crc-ccitt.h>
+#include <linux/skbuff.h>
 #include <uapi/linux/ppp_defs.h>
 
 #define PPP_FCS(fcs, c) crc_ccitt_byte(fcs, c)
@@ -25,4 +26,19 @@ static inline bool ppp_proto_is_valid(u16 proto)
 	return !!((proto & 0x0101) == 0x0001);
 }
 
+/**
+ * ppp_skb_is_compressed_proto - checks if PPP protocol in a skb is compressed
+ * @skb: skb to check
+ *
+ * Check if the PPP protocol field is compressed (the least significant
+ * bit of the most significant octet is 1). skb->data must point to the PPP
+ * protocol header.
+ *
+ * Return: Whether the PPP protocol field is compressed.
+ */
+static inline bool ppp_skb_is_compressed_proto(const struct sk_buff *skb)
+{
+	return unlikely(skb->data[0] & 0x01);
+}
+
 #endif /* _PPP_DEFS_H_ */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] xfrm: fix memory leak in xfrm_add_policy()
From: Deepanshu Kartikey @ 2026-04-14  2:12 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: steffen.klassert, herbert, davem, edumazet, kuba, pabeni, horms,
	leon, netdev, linux-kernel, syzbot+901d48e0b95aed4a2548
In-Reply-To: <adz_CeDItDjznfWo@krikkit>

On Mon, Apr 13, 2026 at 8:04 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> Ok. Then you should wait 2 weeks until the merge window is over:
> https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
>
> and use "[PATCH ipsec-next]" as prefix for the cleanup patch (+ drop
> the syzbot references).
>

Hi Sabrina,

Thanks for the guidance.  I have submitted patch v3.

Thanks

Deepanshu

^ permalink raw reply

* Re: [PATCH net-next v2 0/3] Follow-ups to nk_qlease net selftests
From: Jakub Kicinski @ 2026-04-14  2:12 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, dw, pabeni, razor
In-Reply-To: <20260413220809.604592-1-daniel@iogearbox.net>

On Tue, 14 Apr 2026 00:08:03 +0200 Daniel Borkmann wrote:
> This is a set of follow-ups addressing [0]:
> 
> - Split netdevsim tests from HW tests in nk_qlease and move the SW
>   tests under selftests/net/
> - Remove multiple ksft_run()s to fix the recently enforced hard-fail
> - Move all the setup inside the test cases for the ones under
>   selftests/net/ (I'll defer the HW ones to David)
> - Add more test coverage related to queue leasing behavior and corner
>   cases, so now we have 45 tests in nk_qlease.py with netdevsim
>   which does not need special HW

LGTM, thanks!

I'll let it run overnight in the CI to shake out any latent flakiness
(and the crash which I think is from Stan's series).

Could you cook up one more follow up to enable VETH in the config?
We're getting:

# # Exception| Traceback (most recent call last):
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/ksft.py", line 420, in ksft_run
# # Exception|     func(*args)
# # Exception|     ~~~~^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/drivers/net/hw/./nk_qlease.py", line 393, in test_veth_queue_create
# # Exception|     ip("link add veth0 type veth peer name veth1")
# # Exception|     ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 238, in ip
# # Exception|     return tool('ip', args, json=json, host=host)
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 225, in tool
# # Exception|     cmd_obj = cmd(cmd_str, ns=ns, host=host)
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 91, in __init__
# # Exception|     self.process(terminate=False, fail=fail, timeout=timeout)
# # Exception|     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # Exception|   File "/srv/vmksft/testing/wt-24/tools/testing/selftests/net/lib/py/utils.py", line 117, in process
# # Exception|     raise CmdExitFailure("Command failed", self)
# # Exception| net.lib.py.utils.CmdExitFailure: Command failed
# # Exception| CMD: ip link add veth0 type veth peer name veth1
# # Exception|   EXIT: 2
# # Exception|   STDERR: Error: Unknown device type.
# # Exception| 
# not ok 27 nk_qlease.test_veth_queue_create


I guess you can post it without waiting for this to be merged, it won't
conflict.

^ permalink raw reply

* Re: [PATCH net-next v2] net/smc: cap allocation order for SMC-R physically contiguous buffers
From: D. Wythe @ 2026-04-14  2:10 UTC (permalink / raw)
  To: Simon Horman
  Cc: D. Wythe, David S. Miller, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Tony Lu, Wen Gu, linux-kernel, linux-rdma, linux-s390, netdev,
	oliver.yang, pasic
In-Reply-To: <20260410151631.GY469338@kernel.org>

On Fri, Apr 10, 2026 at 04:16:31PM +0100, Simon Horman wrote:
> On Tue, Apr 07, 2026 at 08:43:37PM +0800, D. Wythe wrote:
> > The alloc_pages() cannot satisfy requests exceeding MAX_PAGE_ORDER,
> > and attempting such allocations will lead to guaranteed failures
> > and potential kernel warnings.
> > 
> > For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER.
> > This ensures the attempts to allocate the largest possible physically
> > contiguous chunk succeed, instead of failing with an invalid order.
> > This also avoids redundant "try-fail-degrade" cycles in
> > __smc_buf_create().
> > 
> > For SMCR_MIXED_BUFS, no cap is needed: if the order exceeds
> > MAX_PAGE_ORDER, alloc_pages() will silently fail (__GFP_NOWARN)
> > and automatically fall back to virtual memory.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> > Changes v1 -> v2:
> > https://lore.kernel.org/netdev/20260312082154.36971-1-alibuda@linux.alibaba.com/
> > 
> > - Move the bufsize cap from smcr_new_buf_create() up to
> >   __smc_buf_create(), which is simpler and avoids touching
> >   the allocation logic itself.
> 
> The nit below notwithstanding, this looks good to me.
> 
> Reviewed-by: Simon Horman <horms@kernel.org>
> 
> > ---
> >  net/smc/smc_core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> > index e2d083daeb7e..cdd881746e21 100644
> > --- a/net/smc/smc_core.c
> > +++ b/net/smc/smc_core.c
> > @@ -2440,6 +2440,10 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
> >  		/* use socket send buffer size (w/o overhead) as start value */
> >  		bufsize = smc->sk.sk_sndbuf / 2;
> >  
> > +	/* limit bufsize for physically contiguous buffers */
> > +	if (!is_smcd && lgr->buf_type == SMCR_PHYS_CONT_BUFS)
> > +		bufsize = min_t(int, bufsize, (PAGE_SIZE << MAX_PAGE_ORDER));
> 
> nit: I think min() is sufficient here, and the inner parentheses are
>      unnecessary

Hi Simon,

I think min_t is required here because min() triggers a signedness
error:

././include/linux/compiler_types.h:706:38: error: call to
‘__compiletime_assert_950’ declared with attribute error: min(bufsize,
((1UL) << 12) << 10) signedness error

The inner parentheses can be removed, though.

D. Wythe

> 
> > +
> >  	for (bufsize_comp = smc_compress_bufsize(bufsize, is_smcd, is_rmb);
> >  	     bufsize_comp >= 0; bufsize_comp--) {
> >  		if (is_rmb) {
> > -- 
> > 2.45.0
> > 

^ permalink raw reply

* [PATCH ipsec-next v3] xfrm: cleanup error path in xfrm_add_policy()
From: Deepanshu Kartikey @ 2026-04-14  2:09 UTC (permalink / raw)
  To: steffen.klassert, herbert, davem, edumazet, kuba, pabeni, horms,
	sd
  Cc: netdev, linux-kernel, Deepanshu Kartikey

Replace the open-coded manual cleanup in the error path of
xfrm_add_policy() with xfrm_policy_destroy(), which already
handles all the necessary cleanup internally. This is consistent
with how xfrm_policy_construct() handles its own error paths.

The walk.dead flag must be set before calling xfrm_policy_destroy()
as required by BUG_ON(!policy->walk.dead).

Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
---
v3:
  - Changed prefix to ipsec-next as this is a cleanup
  - Dropped syzbot references as suggested by Sabrina Dubroca
v2:
  - Reworded commit message to reflect cleanup rather than bugfix
    as suggested by Sabrina Dubroca
  - Removed incorrect Fixes: and Closes: tags
  - Corrected subject prefix to PATCH ipsec
---
 net/xfrm/xfrm_user.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d56450f61669..ae144d1e4a65 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2267,9 +2267,8 @@ static int xfrm_add_policy(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	if (err) {
 		xfrm_dev_policy_delete(xp);
-		xfrm_dev_policy_free(xp);
-		security_xfrm_policy_free(xp->security);
-		kfree(xp);
+		xp->walk.dead = 1;
+		xfrm_policy_destroy(xp);
 		return err;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v7 00/15] net: sleepable ndo_set_rx_mode
From: Jakub Kicinski @ 2026-04-14  2:08 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, davem, edumazet, pabeni
In-Reply-To: <20260413171131.550126-1-sdf@fomichev.me>

On Mon, 13 Apr 2026 10:11:16 -0700 Stanislav Fomichev wrote:
> This series adds a new ndo_set_rx_mode_async callback that enables
> drivers to handle address list updates in a sleepable context. The
> current ndo_set_rx_mode is called under the netif_addr_lock spinlock
> with BHs disabled, which prevents drivers from sleeping. This is
> problematic for ops-locked drivers that need to sleep.

Hi hi, hit this on the new(ish) queue leasing test with bnxt (debug
kernel):

| [ 1148.733157] kselftest: Running tests in drivers/net/hw
[ 1151.485032] ref_tracker: reference already released.
[ 1151.491522] ref_tracker: allocated in:
[ 1151.496526]  __dev_set_rx_mode+0x398/0x4f0
[ 1151.501923]  dev_mc_add+0xe7/0x100
[ 1151.506537]  igmp6_group_added+0x31c/0x400
[ 1151.511930]  __ipv6_dev_mc_inc+0x282/0x590
[ 1151.517320]  __ipv6_sock_mc_join+0x40d/0x7c0
[ 1151.522908]  do_ipv6_setsockopt+0x3504/0x3700
[ 1151.528594]  ipv6_setsockopt+0x7e/0xf0
[ 1151.533596]  do_sock_setsockopt+0x164/0x3b0
[ 1151.539089]  __sys_setsockopt+0xe4/0x150
[ 1151.544276]  __x64_sys_setsockopt+0xbd/0x180
[ 1151.549866]  do_syscall_64+0xf3/0x5e0
[ 1151.554763]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1151.561229] ref_tracker: freed in:
[ 1151.565840]  netdev_rx_mode_work+0x205/0x410
[ 1151.571429]  process_one_work+0xdf5/0x1410
[ 1151.576824]  worker_thread+0x4f1/0xd60
[ 1151.581827]  kthread+0x364/0x460
[ 1151.586246]  ret_from_fork+0x4a4/0x720
[ 1151.591251]  ret_from_fork_asm+0x11/0x20
[ 1151.596704] ------------[ cut here ]------------
[ 1151.602747] WARNING: lib/ref_tracker.c:322 at ref_tracker_free.cold+0x10a/0x1f9, CPU#7: kworker/7:2/5262
[ 1151.614292] Modules linked in:
[ 1151.618626] CPU: 7 UID: 0 PID: 5262 Comm: kworker/7:2 Not tainted 7.0.0-rc7-hmtc-g3cb8f09d448a #1 PREEMPT(full) 
[ 1151.630928] Hardware name: Giga Computing E163-Z34-AAH1-000/MZ33-DC1-000, BIOS R30_F44 12/24/2025
[ 1151.641762] Workqueue: events netdev_rx_mode_work
[ 1151.647920] RIP: 0010:ref_tracker_free.cold+0x10a/0x1f9
[ 1151.654646] Code: e0 2a 0f b6 04 02 84 c0 74 04 3c 03 7e 78 8b 7b 18 4c 89 04 24 e8 71 b2 e4 01 4c 8b 04 24 4c 89 c6 48 89 ef e8 32 92 05 04 90 <0f> 0b 90 ba ea ff ff ff e9 06 de e4 01 4c 89 ea 48 89 df 4c 89 04
[ 1151.676612] RSP: 0018:ffa0000035f6fb18 EFLAGS: 00010296
[ 1151.683343] RAX: 0000000000000000 RBX: ff110001f7ae9460 RCX: 0000000000000001
[ 1151.692224] RDX: 00000000ffffffff RSI: 1ffffffff1841334 RDI: 0000000000000001
[ 1151.701102] RBP: ff110001a3c12618 R08: 0000000000000000 R09: 0000000000000000
[ 1151.709973] R10: 0000000000000007 R11: 0000000000000001 R12: 1ff4000006bedf67
[ 1151.718835] R13: ff110001f7ae9478 R14: ff110001fd1949fc R15: ffffffff8a80b9a0
[ 1151.727814] FS:  0000000000000000(0000) GS:ff11001882899000(0000) knlGS:0000000000000000
[ 1151.737696] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1151.744947] CR2: 00007f83287c3574 CR3: 00000017e26b0002 CR4: 0000000000771ef0
[ 1151.753758] PKRU: 55555554
[ 1151.757588] Call Trace:
[ 1151.761130]  <TASK>
[ 1151.764282]  ? ref_tracker_alloc+0x460/0x460
[ 1151.769872]  ? netdev_rx_mode_work+0x205/0x410
[ 1151.775659]  ? process_one_work+0xdf5/0x1410
[ 1151.781249]  ? worker_thread+0x4f1/0xd60
[ 1151.786450]  ? kthread+0x364/0x460
[ 1151.791066]  ? ret_from_fork+0x4a4/0x720
[ 1151.796267]  ? ret_from_fork_asm+0x11/0x20
[ 1151.801666]  ? mark_held_locks+0x40/0x70
[ 1151.806870]  netdev_rx_mode_work+0x205/0x410
[ 1151.812461]  ? trace_workqueue_execute_start+0x9b/0x180
[ 1151.819124]  ? process_one_work+0xdb4/0x1410
[ 1151.824712]  process_one_work+0xdf5/0x1410
[ 1151.830111]  ? pwq_dec_nr_in_flight+0x710/0x710
[ 1151.835992]  ? lock_acquire.part.0+0xbc/0x260
[ 1151.841686]  worker_thread+0x4f1/0xd60
[ 1151.846694]  ? rescuer_thread+0x1320/0x1320
[ 1151.852188]  kthread+0x364/0x460
[ 1151.856609]  ? trace_irq_enable.constprop.0+0x9b/0x180
[ 1151.863180]  ? kthread_affine_node+0x330/0x330
[ 1151.868965]  ret_from_fork+0x4a4/0x720
[ 1151.873965]  ? arch_exit_to_user_mode_prepare.isra.0+0xb0/0xb0
[ 1151.881315]  ? __switch_to+0x540/0xd00
[ 1151.886320]  ? kthread_affine_node+0x330/0x330
[ 1151.892108]  ret_from_fork_asm+0x11/0x20
| [ 1151.905076] hardirqs last  enabled at (7665): [<ffffffff847c61ea>] __up_console_sem+0x5a/0x70
| [ 1151.915450] hardirqs last disabled at (7676): [<ffffffff847c61cf>] __up_console_sem+0x3f/0x70
| [ 1151.925823] softirqs last  enabled at (7600): [<ffffffff8462d50e>] handle_softirqs+0x60e/0x920
| [ 1151.936291] softirqs last disabled at (7595): [<ffffffff8462dca2>] irq_exit_rcu+0xa2/0xf0
| [ 1151.946275] ---[ end trace 0000000000000000 ]---
[ 1152.247817] ref_tracker: netdev@ff110001a3c12618 has 1/1 users at\x0a     __dev_set_rx_mode+0x398/0x4f0\x0a     dev_mc_add+0xe7/0x100\x0a     igmp6_group_added+0x31c/0x400\x0a     __ipv6_dev_mc_inc+0x282/0x590\x0a     addrconf_dad_begin+0x13c/0x540\x0a     addrconf_dad_work+0x170/0x930\x0a     process_one_work+0xdf5/0x1410\x0a     worker_thread+0x4f1/0xd60\x0a     kthread+0x364/0x460\x0a     ret_from_fork+0x4a4/0x720\x0a     ret_from_fork_asm+0x11/0x20\x0a
[ 1152.318767] ------------[ cut here ]------------
[ 1152.325512] WARNING: lib/ref_tracker.c:246 at ref_tracker_dir_exit+0x466/0x7e0, CPU#27: ip/15056
[ 1152.336246] Modules linked in:
[ 1152.340578] CPU: 27 UID: 0 PID: 15056 Comm: ip Tainted: G        W           7.0.0-rc7-hmtc-g3cb8f09d448a #1 PREEMPT(full) 
[ 1152.353871] Tainted: [W]=WARN
[ 1152.357997] Hardware name: Giga Computing E163-Z34-AAH1-000/MZ33-DC1-000, BIOS R30_F44 12/24/2025
[ 1152.368759] RIP: 0010:ref_tracker_dir_exit+0x466/0x7e0
[ 1152.375326] Code: e8 03 42 80 3c 38 00 0f 85 57 02 00 00 48 8b 03 49 89 de 49 39 dc 0f 85 2e ff ff ff 48 8b 74 24 10 48 89 ef e8 cb aa 20 02 90 <0f> 0b 90 48 8d 5d 44 be 04 00 00 00 48 89 df e8 a6 ee ec fe 48 89
[ 1152.397212] RSP: 0018:ffa00000380470d0 EFLAGS: 00010286
[ 1152.403876] RAX: 0000000000000000 RBX: ff110001a3c12668 RCX: 0000000000000001
[ 1152.412685] RDX: 00000000ffffffff RSI: 1ffffffff1841334 RDI: 0000000000000001
[ 1152.421496] RBP: ff110001a3c12618 R08: 0000000000000000 R09: 0000000000000000
[ 1152.430303] R10: 000000000000000b R11: 0000000000000001 R12: ff110001a3c12668
[ 1152.439103] R13: dead000000000100 R14: ff110001a3c12668 R15: dffffc0000000000
[ 1152.447911] FS:  00007f05f6331840(0000) GS:ff11001883299000(0000) knlGS:0000000000000000
[ 1152.457811] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1152.465057] CR2: 00007f68f567f440 CR3: 00000001a9d12003 CR4: 0000000000771ef0
[ 1152.473863] PKRU: 55555554
[ 1152.477694] Call Trace:
[ 1152.481231]  <TASK>
[ 1152.484384]  ? rcu_is_watching+0x15/0xd0
[ 1152.489585]  ? ref_tracker_free+0x870/0x870
[ 1152.495077]  ? lockdep_hardirqs_on+0x8c/0x130
[ 1152.500764]  ? __stack_depot_get_stack_record+0x10/0x10
[ 1152.507427]  ? kfree+0x151/0x610
[ 1152.511847]  free_netdev+0x402/0x870
[ 1152.516648]  ? do_raw_spin_unlock+0x59/0x250
[ 1152.522241]  ? _raw_spin_unlock_irqrestore+0x40/0x80
[ 1152.528613]  netdev_run_todo+0x83b/0xbf0
[ 1152.533804]  ? unregister_netdevice_queued+0x80/0x80
[ 1152.540172]  ? mutex_is_locked+0x1c/0x50
[ 1152.545373]  ? generic_xdp_install+0x400/0x400
[ 1152.551158]  ? mutex_is_locked+0x1c/0x50
[ 1152.556349]  ? rtnl_is_locked+0x15/0x20
[ 1152.561450]  ? unregister_netdevice_queued+0x16/0x80
[ 1152.567826]  rtnl_dellink+0x4a5/0xae0
[ 1152.572737]  ? rtnl_mdb_del+0x580/0x580
[ 1152.577839]  ? __lock_acquire+0x508/0xc10
[ 1152.583148]  ? avc_has_perm_noaudit+0xf6/0x300
[ 1152.588935]  ? sched_init_numa+0x7f7/0xc30
[ 1152.594331]  ? mark_usage+0x61/0x170
[ 1152.599142]  ? __lock_acquire+0x508/0xc10
[ 1152.604443]  ? lock_acquire.part.0+0xbc/0x260
[ 1152.610130]  ? find_held_lock+0x2b/0x80
[ 1152.615237]  ? rtnl_mdb_del+0x580/0x580
[ 1152.620339]  ? __lock_release.isra.0+0x6b/0x1a0
[ 1152.626215]  ? rtnl_mdb_del+0x580/0x580
[ 1152.631319]  rtnetlink_rcv_msg+0x6fd/0xbd0
[ 1152.636718]  ? rtnl_fdb_dump+0x690/0x690
[ 1152.641917]  ? __lock_acquire+0x508/0xc10
[ 1152.647218]  ? lock_acquire.part.0+0xbc/0x260
[ 1152.652905]  ? find_held_lock+0x2b/0x80
[ 1152.658008]  netlink_rcv_skb+0x14e/0x3a0
[ 1152.663210]  ? rtnl_fdb_dump+0x690/0x690
[ 1152.668412]  ? netlink_ack+0xcd0/0xcd0
[ 1152.673422]  ? netlink_deliver_tap+0xc5/0x330
[ 1152.679110]  ? netlink_deliver_tap+0x13c/0x330
[ 1152.684894]  netlink_unicast+0x47c/0x740
[ 1152.690098]  ? netlink_attachskb+0x800/0x800
[ 1152.695687]  ? sock_has_perm+0x283/0x3f0
[ 1152.700879]  netlink_sendmsg+0x75b/0xc90
[ 1152.706081]  ? netlink_unicast+0x740/0x740
[ 1152.711476]  ? __lock_release.isra.0+0x6b/0x1a0
[ 1152.717357]  ? __import_iovec+0x36c/0x620
[ 1152.722655]  ? __might_fault+0x97/0x140
[ 1152.727749]  __sock_sendmsg+0xca/0x180
[ 1152.732747]  ? move_addr_to_kernel+0x36/0xf0
[ 1152.738328]  ____sys_sendmsg+0x609/0x830
[ 1152.743529]  ? copy_msghdr_from_user+0x2a0/0x460
[ 1152.749510]  ? kernel_sendmsg+0x30/0x30
[ 1152.754611]  ? move_addr_to_kernel+0xf0/0xf0
[ 1152.760204]  ? kasan_save_stack+0x3d/0x50
[ 1152.765503]  ? kasan_save_stack+0x2f/0x50
[ 1152.770799]  ? kasan_record_aux_stack+0x9b/0xc0
[ 1152.776681]  ? __call_rcu_common.constprop.0+0xb2/0xa10
[ 1152.783337]  ? kmem_cache_free+0x3d0/0x5f0
[ 1152.788733]  ? fput_close_sync+0xde/0x1b0
[ 1152.794032]  ? __x64_sys_close+0x8b/0xf0
[ 1152.799235]  ___sys_sendmsg+0x14e/0x1d0
[ 1152.804337]  ? copy_msghdr_from_user+0x460/0x460
[ 1152.810330]  ? rcu_is_watching+0x15/0xd0
[ 1152.815532]  ? trace_irq_enable.constprop.0+0x9b/0x180
[ 1152.822106]  __sys_sendmsg+0x145/0x1f0
[ 1152.827110]  ? __sys_sendmsg_sock+0x20/0x20
[ 1152.832608]  ? do_raw_spin_unlock+0x59/0x250
[ 1152.838196]  ? rcu_is_watching+0x15/0xd0
[ 1152.843398]  do_syscall_64+0xf3/0x5e0
[ 1152.848307]  ? trace_hardirqs_off+0xd/0x30
[ 1152.853703]  ? exc_page_fault+0xda/0xf0
[ 1152.858805]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1152.865267] RIP: 0033:0x7f05f656b22e
[ 1152.870077] Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 1152.891961] RSP: 002b:00007fff388be520 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 1152.901262] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f05f656b22e
[ 1152.910070] RDX: 0000000000000000 RSI: 00007fff388be5d0 RDI: 0000000000000003
[ 1152.918880] RBP: 00007fff388be530 R08: 0000000000000000 R09: 0000000000000000
[ 1152.927690] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 1152.936499] R13: 0000000069dd8ee8 R14: 000056113dabf040 R15: 0000000000000000


Either decoding failed or I forgot where NIPA-HW puts the output :S
But I think it's clear enough..

^ permalink raw reply

* Re: [PATCH net-next] pppoe: optimize hash with word access
From: Qingfang Deng @ 2026-04-14  1:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Guillaume Nault, Kees Cook, Eric Woudstra, netdev, linux-kernel
In-Reply-To: <CANn89iKmxWiCTt7nVk-DZ4R_KsYDYbPwQ5f7Hp5F8hWmr+zc=g@mail.gmail.com>

April 13, 2026 at 4:42 PM, Eric Dumazet wrote:

> 
> On Sun, Apr 12, 2026 at 8:52 PM Qingfang Deng <qingfang.deng@linux.dev> wrote:
> 
> net-next is closed.
> 
> https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
> 
> Also I would suggest using hash32(hash, PPPOE_HASH_BITS)

Thanks for the info, but I would like to keep the same algorithm.

^ permalink raw reply

* Re: commit 0c4f1c02d27a880b cause a deadlock issue
From: He, Guocai (CN) @ 2026-04-14  1:38 UTC (permalink / raw)
  To: Greg KH, Thorsten Leemhuis
  Cc: Berg, Johannes, Friend, Linux kernel regressions list,
	Korenblit, Miriam Rachel, stable@vger.kernel.org
In-Reply-To: <2026041330-groggy-ruse-5e27@gregkh>

OK, I will send a patch later.

________________________________________
From: Greg KH <gregkh@linuxfoundation.org>
Sent: Monday, April 13, 2026 8:55 PM
To: Thorsten Leemhuis
Cc: He, Guocai (CN); Berg, Johannes; Friend; Linux kernel regressions list; Korenblit, Miriam Rachel; stable@vger.kernel.org
Subject: Re: commit 0c4f1c02d27a880b cause a deadlock issue

CAUTION: This email comes from a non Wind River email account!
Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Mon, Apr 13, 2026 at 01:58:56PM +0200, Thorsten Leemhuis wrote:
> On 4/3/26 15:00, Korenblit, Miriam Rachel wrote:
> >> From: Greg KH <gregkh@linuxfoundation.org>
> >> On Fri, Apr 03, 2026 at 12:44:48PM +0000, Korenblit, Miriam Rachel wrote:
> >>>> -----Original Message-----
> >>>> From: Greg KH <gregkh@linuxfoundation.org>
> >>>> On Fri, Apr 03, 2026 at 11:08:46AM +0000, He, Guocai (CN) wrote:
> >>>>> No, The mainline have no this issue.
> >>>>> The changes of 0c4f1c02d27a880b is not in mainline.
> >>>>
> >>>> That does not make sense, that commit is really commit e1696c8bd005
> >>>> ("wifi: cfg80211: stop NAN and P2P in cfg80211_leave") which is in
> >>>> all of the following releases:
> >>>>  5.10.252 5.15.202 6.1.165 6.6.128 6.12.75 6.18.14 6.19.4 7.0-rc1
> >>>> confused,
> >>> The change is indeed in mainline, but the locking situation in
> >>> mainline is totally different (that mutex does not even exist there)
> >>> Therefore, the issue is not supposed to happen in mainline.
> >>
> >> Ok, does that commit now need to be reverted from some of the stable branches?
> >> If so, which ones?
> >
> > From every version which is < 6.7.
>
> Greg, do you still have this in your todo mail queue somewhere? Just
> wondering, as last weeks 6.6.y released afics lacked a revert of
> e1696c8bd0056b ("wifi: cfg80211: stop NAN and P2P in cfg80211_leave") --
> and I cannot spot one in your public stable queue either.
>
> These are the commits that according to Miri need to be reverted if I
> understood things right:
>
> v6.6.128 (4d7a05da767e5c), v6.1.165 (0c4f1c02d27a88), v5.15.202
> (31344ffecd7a34), v5.10.252 (d91240f24e831d)

It is, yes, my queue is huge :(

It's fastest if someone sends me the reverts and I can easily apply them
that way.  Otherwise it takes me a bit to do each one manually :(

thanks,

greg k-h

^ permalink raw reply

* [PATCH net] tcp: make probe0 timer handle expired user timeout
From: Altan Hacigumus @ 2026-04-14  1:36 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S . Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, Enke Chen, Altan Hacigumus

tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue.  If elapsed probing time
already exceeds the configured TCP_USER_TIMEOUT, the subtraction
underflows and yields a large value.

Handle this expiration case similarly to tcp_clamp_rto_to_user_timeout().

Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes")
Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com>
---
 net/ipv4/tcp_timer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 5a14a53a3c9e..4a43356a4e06 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -50,7 +50,8 @@ static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
-	u32 remaining, user_timeout;
+	u32 user_timeout;
+	s32 remaining;
 	s32 elapsed;
 
 	user_timeout = READ_ONCE(icsk->icsk_user_timeout);
@@ -61,6 +62,8 @@ u32 tcp_clamp_probe0_to_user_timeout(const struct sock *sk, u32 when)
 	if (unlikely(elapsed < 0))
 		elapsed = 0;
 	remaining = msecs_to_jiffies(user_timeout) - elapsed;
+	if (remaining <= 0)
+		return 1;
 	remaining = max_t(u32, remaining, TCP_TIMEOUT_MIN);
 
 	return min_t(u32, remaining, when);
-- 
2.43.0


^ permalink raw reply related

* Re: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic per-queue stats
From: Jacob Keller @ 2026-04-14  1:22 UTC (permalink / raw)
  To: Paolo Abeni, Loktionov, Aleksandr,
	intel-wired-lan@lists.osuosl.org
  Cc: Nguyen, Anthony L, Kitszel, Przemyslaw, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, netdev@vger.kernel.org
In-Reply-To: <2bae0dc2-4035-4fe2-a87e-dc5dae6c7df5@redhat.com>

On 4/8/2026 7:44 AM, Paolo Abeni wrote:
> On 4/8/26 2:07 PM, Loktionov, Aleksandr wrote:
>>> -----Original Message-----
>>> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
>>> Of Paolo Abeni
>>> Sent: Wednesday, April 8, 2026 1:44 PM
>>> To: intel-wired-lan@lists.osuosl.org
>>> Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel,
>>> Przemyslaw <przemyslaw.kitszel@intel.com>; Andrew Lunn
>>> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
>>> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>;
>>> Alexei Starovoitov <ast@kernel.org>; Daniel Borkmann
>>> <daniel@iogearbox.net>; Jesper Dangaard Brouer <hawk@kernel.org>; John
>>> Fastabend <john.fastabend@gmail.com>; Stanislav Fomichev
>>> <sdf@fomichev.me>; netdev@vger.kernel.org
>>> Subject: [Intel-wired-lan] [PATCH iwl-next 1/2] i40e: implement basic
>>> per-queue stats
>>>
>>> Only expose the counters currently available (bytes, packets); add
>>> account for base stats to deal with ring clear.
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>>  drivers/net/ethernet/intel/i40e/i40e.h      |   7 ++
>>>  drivers/net/ethernet/intel/i40e/i40e_main.c | 133
>>> ++++++++++++++++++++
>>>  2 files changed, 140 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h
>>> b/drivers/net/ethernet/intel/i40e/i40e.h
>>> index dcb50c2e1aa2..fe642c464e9c 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>> @@ -836,16 +836,23 @@ struct i40e_vsi {
>>>  	struct i40e_eth_stats eth_stats;
>>>  	struct i40e_eth_stats eth_stats_offsets;
>>>  	u64 tx_restart;
>>
>> ...
>>
>>> +static void i40e_zero_tx_ring_stats(struct netdev_queue_stats_tx *tx)
>>> {
>>> +	tx->bytes = 0;
>>> +	tx->packets = 0;
>>> +	tx->stop = 0;
>>> +	tx->wake = 0;
>>> +	tx->hw_drops = 0;
>>> +}
>>> +
>>> +static void i40e_add_tx_ring_stats(struct i40e_ring *tx_ring,
>>> +				   struct netdev_queue_stats_tx *tx) {
>>> +	u64 bytes, packets;
>>> +	unsigned int start;
>>> +
>>> +	do {
>>> +		start = u64_stats_fetch_begin(&tx_ring->syncp);
>>> +		bytes = tx_ring->stats.bytes;
>>> +		packets = tx_ring->stats.packets;
>>> +	} while (u64_stats_fetch_retry(&tx_ring->syncp, start));
>>> +
>>> +	tx->bytes += bytes;
>>> +	tx->packets += packets;
>>> +
>>> +	tx->stop += tx_ring->tx_stats.tx_stopped;
>>> +	tx->wake += tx_ring->tx_stats.restart_queue;
>>> +	tx->hw_drops += tx_ring->tx_stats.tx_busy; }
>> Why the reads are outside the seqlock region? 
>> On 32-bit kernels, unprotected u64 reads can tear IMHO
> 

Paolo is correct that just moving these into the do/while loop is
useless, since the increments aren't protected properly.

> Currently there is no seqlock on the write side; to keep the series
> small I preferred avoid fixing the pre-existing issue. In any case I
> think moving stop, wake, hw_drops (and others) under seqlock protection
> is an orthogonal change.
> 
> /P

I ended up doing some work on ice to fix a lot of similar issues a few
months ago.. The intel drivers weren't using u64_stats_t, and several
error/debug counters were not being handled appropriately.

I'd personally prefer fixing existing issues before we compound them by
adding even more incorrect code. Even on 64bit systems we need to use
READ_ONCE/WRITE_ONCE or local64_t, which the u64_stats_t type uses
internally.

I can understand the desire to limit scope of work, and the issues may
feel "minor" but ultimately I'd rather not see us continue making the
problem bigger instead of fixing it.

However.. if other maintainers feel strongly that the additions are
acceptable despite being incorrect w.r.t. the stats logic, I suppose we
can continue this and have someone from Intel look into cleaning up the
mess like I did for ice.

It looks like the series has some other requested changes either way though.

^ permalink raw reply

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: fengchengwen @ 2026-04-14  1:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <20260413191930.GP21470@unreal>

On 4/14/2026 3:19 AM, Leon Romanovsky wrote:
> On Mon, Apr 13, 2026 at 08:04:10PM +0800, fengchengwen wrote:
>> On 4/13/2026 6:01 PM, Leon Romanovsky wrote:
>>> On Fri, Apr 10, 2026 at 10:30:52PM +0800, fengchengwen wrote:
>>>> Hi all,
>>>>
>>>> I'm writing to propose adding a sysfs interface to expose and configure the
>>>> PCIe TPH
>>>> Steering Tag for PCIe devices, which is retrieved inside the kernel.
>>>>
>>>>
>>>> Background: The TPH Steering Tag is tightly coupled with both a PCIe device
>>>> (identified
>>>> by its BDF) and a CPU core. It can only be obtained in kernel mode. To allow
>>>> user-space
>>>> applications to fetch and set this value securely and conveniently, we need
>>>> a standard
>>>> kernel-to-user interface.
>>>>
>>>>
>>>> Proposed Solution: Add several sysfs attributes under each PCIe device's
>>>> sysfs directory:
>>>> 1. /sys/bus/pci/devices/<BDF>/tph_mode to query the TPH mode (interrupt or
>>>> device specific)
>>>> 2. /sys/bus/pci/devices/<BDF>/tph_enable to control the TPH feature
>>>> 3. /sys/bus/pci/devices/<BDF>/tph_st to support both read and write
>>>> operations, e.g.:
>>>>    Read operation:
>>>>      echo "cpu=3" > /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>      cat /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>    Write operation:
>>>>      echo "index=10 st=123" > /sys/bus/pci/devices/0000:01:00.0/tph_st
>>>>
>>>>
>>>> The design strictly follows PCI subsystem sysfs standards and has the
>>>> following key properties:
>>>>
>>>> 1. Dynamic Visibility: The sysfs attributes will only be present for PCIe
>>>> devices that
>>>>    support TPH Steering Tag. Devices without TPH capability will not show
>>>> these nodes,
>>>>    avoiding unnecessary user confusion.
>>>>
>>>> 2. Permission Control: The attributes will use 0600 file permissions,
>>>> ensuring only
>>>>    privileged root users can read or write them, which satisfies security
>>>> requirements
>>>>    for hardware configuration interfaces.
>>>>
>>>> 3. Standard Implementation Location: The interface will be implemented in
>>>>    drivers/pci/pci-sysfs.c, the canonical location for all PCI device sysfs
>>>> attributes,
>>>>    ensuring consistency and maintainability within the PCI subsystem.
>>>>
>>>>
>>>> Why sysfs instead of alternatives like VFIO-PCI ioctl:
>>>>
>>>> - Universality: sysfs does not require binding the device to a special
>>>> driver such as
>>>>   vfio-pci. It is available to any privileged user-space component,
>>>> including system
>>>>   utilities, daemons, and monitoring tools.
>>>>
>>>> - Simplicity: Both user-space usage (cat/echo) and kernel implementation are
>>>>   straightforward, reducing code complexity and long-term maintenance cost.
>>>>
>>>> - Design Alignment: TPH Steering Tag is a generic PCIe device feature, not
>>>> specific to
>>>>   user-space drivers like DPDK or VFIO. Exposing it via sysfs matches the
>>>> kernel's
>>>>   standard pattern for hardware capabilities.
>>>>
>>>>
>>>> I look forward to your comments about this design before submitting the
>>>> final patch.
>>>
>>> You need to explain more clearly why this write functionality is useful
>>> and necessary outside the VFIO/RDMA context:
>>> https://lore.kernel.org/all/20260324234615.3731237-1-zhipingz@meta.com/
>>>
>>> AFAIK, for non-VFIO TPH callers, kernel has enough knowledge to set
>>> right ST values.
>>>
>>> There are several comments regarding the implementation, but those can wait
>>> until the rationale behind the proposal is fully clarified.
>>
>> Thanks for your review and comments.
>>
>> Let me clarify the rationale behind this user-space sysfs interface:
>>
>> 1. VFIO is just one of the user-space device access frameworks.
>>    There are many other in-kernel frameworks that expose devices
>>    to user space, such as UIO, UACCE, etc., which may also require
>>    TPH Steering Tag support.
>>
>> 2. The kernel can automatically program Steering Tags only when
>>    the device provides a standard ST table in MSI-X or config space.
>>    However, many devices implement vendor-specific or platform-specific
>>    Steering Tag programming methods that cannot be fully handled
>>    by the generic kernel code.
>>
>> 3. For such devices, user-space applications or framework drivers
>>    need to retrieve and configure TPH Steering Tags directly.
>>    A unified sysfs interface allows all user-space frameworks
>>    (not just VFIO) to use a common, standard way to manage
>>    TPH Steering Tags, rather than implementing duplicated logic
>>    in each subsystem.
>>
>> This interface provides a uniform method for any user-space
>> device access solution to work with TPH, which is why I believe
>> it is useful and necessary beyond the VFIO/RDMA case.
> 
> I understand the rationale for providing a read interface, for example for
> debugging, but I do not see any justification for a write interface.

Thank you for the comment!

As I explained, read interface is not only for debugging. It was used to
such device who don't declare ST location in MSI-X or config-space, the following
is Intel X710 NIC device's lspci output (only TPH part):

	Capabilities: [1a0 v1] Transaction Processing Hints
		Device specific mode supported
		No steering table available

So we could not config the ST for device on kernel because it's vendor specific.
But we could configure ST by it's vendor user-space driver, in this case, we
should get ST from kernel to user-space.


As for write interface, which was used to devices whose ST location is known, I
think we could simple it, and only passing <index, cpu>, then kernel query cpu's ST
and set to corresponding index.

> 
> TPH is defined by the PCI specification. If a device intends to support it,
> then it should conform to the specification.

According to the PCI specification 6.17.3 ST Modes of Operation:

  Device Specific Mode: It is recommended for the Function to use a Steering Tag value from an ST Table entry, but it is not required.

  In the Device Specific Mode of operation, the assignment of the Steering Tags to Requests is device specific. The number
  of Steering Tags used by the Function is permitted to be different than the number of interrupt vectors allocated for the
  Function, irrespective of the ST Table location, and Steering Tag values used in Requests are not required to come from
  the architected ST Table.

Thanks

> 
> Thanks
> 
> 
>>
>> Thanks
>>
>>>
>>> Thanks
>>>
>>>>
>>>> Best regards,
>>>> Chengwen Feng
>>>>
>>
>>


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-net v2 2/6] ixgbe: add bounds check for debugfs register access
From: Jacob Keller @ 2026-04-14  1:00 UTC (permalink / raw)
  To: Simon Horman, Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Paul Greenwalt
In-Reply-To: <20260413103050.GL469338@kernel.org>

On 4/13/2026 3:30 AM, Simon Horman wrote:
> On Wed, Apr 08, 2026 at 03:11:50PM +0200, Aleksandr Loktionov wrote:
>> From: Paul Greenwalt <paul.greenwalt@intel.com>
>>
>> Prevent out-of-bounds MMIO accesses triggered through user-controlled
>> register offsets.  IXGBE_HFDR (0x15FE8) is the highest valid MMIO
>> register in the ixgbe register map; any offset beyond it would address
>> unmapped memory.
>>
>> Add a defense-in-depth check at two levels:
>>
>> 1. ixgbe_read_reg() -- the noinline register read accessor.  A
>>    WARN_ON_ONCE() guard here catches any future code path (including
>>    ioctl extensions) that might inadvertently pass an out-of-range
>>    offset without relying on higher layers to catch it first.
>>    ixgbe_write_reg() is a static inline called from the TX/RX hot path;
>>    adding WARN_ON_ONCE there would inline the check at every call site,
>>    so only the read path gets this guard.
>>
>> 2. ixgbe_dbg_reg_ops_write() -- the debugfs 'reg_ops' interface is the
>>    only current path where a raw, user-supplied offset enters the driver.
>>    Gating it before invoking the register accessors provides a clean,
>>    user-visible failure (silent ignore with no kernel splat) for
>>    deliberately malformed debugfs writes.
>>
>> Add a reg <= IXGBE_HFDR guard to both the read and write paths in
>> ixgbe_dbg_reg_ops_write(), and a WARN_ON_ONCE + early-return guard to
>> ixgbe_read_reg().
>>
>> Fixes: 91fbd8f081e2 ("ixgbe: added reg_ops file to debugfs")
>> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> ---
>> v1 -> v2:
>>  - Add Fixes: tag; reroute from iwl-next to iwl-net (security-relevant
>>    hardening for user-controllable out-of-bounds MMIO).
> 
> Thanks for the update.
> 
> And sorry for not thinking to ask this earlier: this patch
> addresses possible overruns of the mapped address space if the
> supplied value for reg is too large. But do we also need a
> guard against underrun if the value for reg is too small?
> 

I don't think so. This is bounds checking a register offset which is an
unsigned 32-bit value and begins at 0, so the map goes from 0 to
IXGBE_HFDR. Since the value is unsigned, if it does underflow somehow it
would then get caught by the check for IXGBE_HFDR right?

Thanks,
Jake

> ...


^ permalink raw reply

* Re: [PATCH net-next v2 14/14] net: macb: use context swapping in .ndo_change_mtu()
From: Jakub Kicinski @ 2026-04-14  0:56 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-14-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .ndo_change_mtu()

This commit introduces a mechanism to perform context swapping when changing
the MTU while the interface is running. It allocates a new context,
reconfigures the hardware via a start and end swap process, and then frees
the old context to avoid PHY reinitialization and gracefully handle memory
pressure.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index d1536cb1e40bb..79371718f8392 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3430,11 +3430,25 @@ static int macb_close(struct net_device *netdev)
>  
>  static int macb_change_mtu(struct net_device *netdev, int new_mtu)
>  {
> -	if (netif_running(netdev))
> -		return -EBUSY;
> +	struct macb *bp = netdev_priv(netdev);
> +	bool running = netif_running(netdev);
> +	struct macb_context *new_ctx;
> +
> +	if (running) {
> +		new_ctx = macb_context_alloc(bp, new_mtu,
> +					     bp->configured_rx_ring_size,
> +					     bp->configured_tx_ring_size);
> +		if (IS_ERR(new_ctx))
> +			return PTR_ERR(new_ctx);
> +
> +		macb_context_swap_start(bp);

Does macb_context_swap_start() gracefully halt the transmitter before the
context is freed?

It appears to attempt disabling the hardware transmitter by clearing the
TE and RE bits in the NCR register, but does not poll the hardware to ensure
it has reached an idle state.

Immediately after, macb_context_swap_end() frees the descriptor rings and data
buffers via macb_free_consistent().

If the MAC DMA engine is still mid-transfer, could it read or write to kernel
memory that has already been freed or reallocated?

Can this code cause an AB-BA deadlock between napi_disable() and
cancel_work_sync()?

In macb_context_swap_start(), the code calls napi_disable() for all queues,
and then immediately calls cancel_work_sync() on the error tasks.

Both of these work items internally invoke napi_disable().

If either error task runs concurrently, it will block forever inside its
napi_disable() waiting for the NAPI state bits (which were already set by the
first call in macb_context_swap_start()) to clear.

Consequently, could cancel_work_sync() hang indefinitely waiting for the
blocked work task to finish?

>  	}
>  
>  	WRITE_ONCE(netdev->mtu, new_mtu);
>  
> +	if (running)
> +		macb_context_swap_end(bp, new_ctx);

Does macb_context_swap_end() leak pending TX buffers and their DMA mappings?

When changing the MTU while the interface is running, macb_context_swap_end()
calls macb_free_consistent() to free the old DMA context. While RX buffers are
properly freed, for the TX queues it appears to only free the tracking array
pointer without iterating through the ring to unmap the DMA mappings and free
the actual buffers.

Any packets queued in the hardware ring at the moment of the MTU swap might
be permanently leaked, which could lead to memory and IOMMU space exhaustion
under active traffic.

Will MTU changes on EMAC devices permanently disable hardware interrupts?

macb_context_swap_start() disables all interrupts unconditionally by writing
-1 to the IDR register.

However, macb_context_swap_end() restores interrupts via the IER register
inside an explicit check:

    if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
        ...
    }

Since there is no else branch to re-enable interrupts for EMAC devices,
could changing the MTU on these devices leave interrupts permanently masked
and cause the network interface to silently stall?

Is there a missing spinlock when modifying the IER and NCR registers in
macb_context_swap_end()?

Modifications to the NCR and interrupt registers in macb_context_swap_start()
are protected by the spinlock. However, macb_context_swap_end() performs
lockless read-modify-write operations on both IER and NCR:

    ctrl = macb_readl(bp, NCR);
    macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));

If concurrent hardware events or background tasks take the lock and modify
NCR during this window, could their updates be silently overwritten?

> +
>  	return 0;
>  }

^ permalink raw reply

* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14  0:56 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .set_ringparam()

This patch modifies the ethtool set_ringparam implementation to use a
context swapping mechanism. Instead of closing and reopening the interface,
it allocates new buffers, stops the interface temporarily, swaps the active
context, and then frees the old one.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 3deab87c565b5..d1536cb1e40bb 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
[ ... ]
> +static void macb_context_swap_start(struct macb *bp)
> +{
[ ... ]
> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> +		napi_disable(&queue->napi_rx);
> +		napi_disable(&queue->napi_tx);

This isn't a regression from this patch, but does calling napi_disable()
unconditionally crash on EMAC devices?

EMAC devices appear to not initialize NAPI, which leaves the NAPI struct
zero-initialized (n->dev is NULL), potentially causing a NULL pointer
dereference in napi_disable(). Additionally, does this path also blindly
call bp->macbgem_ops.mog_init_rings() which is NULL on EMAC?

> +		netdev_tx_reset_queue(netdev_get_tx_queue(bp->netdev, q));
> +		cancel_work_sync(&queue->tx_error_task);
> +	}
[ ... ]
> +static void macb_context_swap_end(struct macb *bp,
> +				  struct macb_context *new_ctx)
> +{
[ ... ]
> +	/* Swap contexts & give buffer pointers to HW. */
> +
> +	old_ctx = bp->ctx;
> +	bp->ctx = new_ctx;
> +	macb_init_buffers(bp);

Since the new context swap doesn't call phylink_stop(), could a concurrent
macb_mac_link_up() execution set the RE and TE bits in NCR during the window
where bp->lock is dropped?

If so, macb_init_buffers() would write new DMA ring base addresses to RBQP
and TBQP while the MAC is actively enabled, which violates hardware
specifications and could cause DMA corruption.

[ ... ]
> +	if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
> +		for (q = 0, queue = bp->queues; q < bp->num_queues;
> +		     ++q, ++queue) {
> +			queue_writel(queue, IER,
> +				     bp->rx_intr_mask |
> +				     MACB_TX_INT_FLAGS |
> +				     MACB_BIT(HRESP));
> +		}
> +	}

Hardware interrupts are disabled unconditionally for all devices in
macb_context_swap_start() by writing -1 to IDR.

Will this explicit check prevent EMAC devices from having their interrupts
re-enabled?

Since EMAC devices do not use NAPI and rely on hardware interrupts, leaving
them masked might silently halt all network traffic after a ring parameter
change.

> +
> +	ctrl = macb_readl(bp, NCR);
> +	macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));

Can this unprotected read-modify-write of the NCR register race with phylink
state changes?

If an asynchronous phylink callback like macb_mac_config() modifies NCR
while holding bp->lock, this unlocked write could overwrite and drop those
concurrent updates.

> +
> +	netif_tx_start_all_queues(bp->netdev);
> +
> +	/* Free old context. */
> +
> +	macb_free_consistent(old_ctx);

This isn't a regression from this patch, but does this context swap leak
pending TX packets and their DMA mappings?

When freeing the old context, macb_free_consistent() appears to only free
the pointer array. If it doesn't iterate from txq->tail to txq->head to
unmap the active DMA mappings and free the sk_buff structs, repeatedly
modifying ring parameters could cause a memory and DMA mapping leak.

^ permalink raw reply

* Re: [PATCH net-next v2 12/14] net: macb: re-read ISR inside IRQ handler locked section
From: Jakub Kicinski @ 2026-04-14  0:52 UTC (permalink / raw)
  To: theo.lebrun
  Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
	davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
	nb, vladimir.kondratiev, gregory.clement, benoit.monin,
	tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-12-af39f71d40b6@bootlin.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: re-read ISR inside IRQ handler locked section

This commit adds a second read of the ISR register inside the lock in
macb_interrupt() to handle a race condition where the status might have
been updated by concurrent hardware resets.

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 2633234012bae..3deab87c565b5 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2190,6 +2190,13 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
>  
>  	spin_lock(&bp->lock);
>  
> +	/* `status` stack variable might be stalled => re-read it */
> +	status = queue_readl(queue, ISR);

Will this second read discard interrupt events on older MACB variants that
use clear-on-read for the ISR?

If the hardware does not have the MACB_CAPS_ISR_CLEAR_ON_WRITE capability,
the first queue_readl call before the lock already cleared the active bits.

Overwriting the local status variable here sets it to 0, which permanently
drops the initially captured Rx/Tx completions and stalls the network
interface.

> +	if (unlikely(!status)) {
> +		spin_unlock(&bp->lock);
> +		return IRQ_NONE;

If the function reaches this point, the initial status read before the lock
was non-zero, meaning the device legitimately asserted the interrupt.

If a concurrent operation cleared the hardware state while this thread
waited for the lock, will returning IRQ_NONE falsely flag this as a
spurious interrupt?

Since the interrupt was actually raised by the device, doing this repeatedly
can cause the core IRQ subsystem to permanently disable the IRQ line.

Could this return IRQ_HANDLED instead?

> +	}
> +
>  	while (status) {
>  		/* close possible race with dev_close */
>  		if (unlikely(!netif_running(netdev))) {
-- 
pw-bot: cr

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH iwl-net v2 1/6] ixgbe: fix SWFW semaphore timeout for X550 family
From: Jacob Keller @ 2026-04-14  0:56 UTC (permalink / raw)
  To: Aleksandr Loktionov, intel-wired-lan, anthony.l.nguyen; +Cc: netdev
In-Reply-To: <20260408131154.2661818-2-aleksandr.loktionov@intel.com>

On 4/8/2026 6:11 AM, Aleksandr Loktionov wrote:
> According to FW documentation, the most time-consuming FW operation is
> Shadow RAM (SR) dump which takes up to 3.2 seconds.  For X550 family
> devices the module-update FW command can take over 4.5 s.  The default
> semaphore loop runs 200 iterations with a 5 ms sleep each, giving a
> maximum wait of 1 s -- not "200 ms" as previously stated in error.
> This is insufficient for X550 family FW update operations and causes
> spurious EBUSY failures.
> 
> Extend the SW/FW semaphore timeout from 1 s to 5 s (1000 iterations x
> 5 ms) for all three X550 variants: ixgbe_mac_X550, ixgbe_mac_X550EM_x,
> and ixgbe_mac_x550em_a.  All three share the same FW and exhibit the
> same worst-case latency.  Use three explicit mac.type comparisons rather
> than a range check so future MAC additions are not inadvertently
> captured.
> 
> The timeout variable is set immediately before the loop so the intent
> is clear, with an inline comment stating the resulting maximum delay.
> 
> Suggested-by: Soumen Karmakar <soumen.karmakar@intel.com>
> Cc: stable@vger.kernel.org
> Suggested-by: Marta Plantykow <marta.a.plantykow@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
>  - Squash with 0015 (X550EM extension); fix commit message ("200ms" was
>    wrong, actual default is 1 s); replace >= / <= range check with three
>    explicit mac.type == comparisons per Tony Nguyen.
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> index e67e2fe..a3c8f51 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> @@ -577,6 +577,15 @@ int ixgbe_acquire_swfw_sync_X540(struct ixgbe_hw *hw, u32 mask)
>  
>  	swmask |= swi2c_mask;
>  	fwmask |= swi2c_mask << 2;
> +	/* Extend to 5 s (1000 x 5 ms) for X550 family; default is 1 s
> +	 * (200 x 5 ms).  FW SR-dump takes up to 3.2 s; module-update up
> +	 * to 4.5 s.
> +	 */
> +	if (hw->mac.type == ixgbe_mac_X550 ||
> +	    hw->mac.type == ixgbe_mac_X550EM_x ||
> +	    hw->mac.type == ixgbe_mac_x550em_a)
> +		timeout = 1000;
> +

Typically, I would request and prefer if we would refactor timeout loops
like this to use read_poll_timeout() instead of open coding the loop
like we do here. The current loop is somewhat complicated so I can
understand it might be tricky to refactor.

The issue with open coded loops like this is that usleep_range has
variable length waiting time, so we sleep for anywhere between 5 and 6
milliseconds in this case. This makes the total amount of time waiting
cap at 6 seconds and not the expected 5, with the actual amount of time
waiting being variable based on when the usleep_range wakes up.

Perhaps this specific loop is a bit more complicated and not worth the
effort to refactor to read_poll_timeout, but its something I've been
trying to get us to cleanup (both inside Intel drivers and in other
places in the kernel) when modifying such open-coded timeout loops.

Since this loop body is a bit more complicated (it has to take and
release the semaphore to check the condition) I can accept it doesn't
make sense to modify it here for net.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Thanks,
Jake

>  	for (i = 0; i < timeout; i++) {
>  		/* SW NVM semaphore bit is used for access to all
>  		 * SW_FW_SYNC bits (not just NVM)


^ permalink raw reply

* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14  0:50 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
	Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>

On Fri, 10 Apr 2026 21:52:01 +0200 Théo Lebrun wrote:
> ethtool_ops.set_ringparam() is implemented using the primitive close /
> update ring size / reopen sequence. Under memory pressure this does not
> fly: we free our buffers at close and cannot reallocate new ones at
> open. Also, it triggers a slow PHY reinit.
> 
> Instead, exploit the new context mechanism and improve our sequence to:
>  - allocate a new context (including buffers) first
>  - if it fails, early return without any impact to the interface
>  - stop interface
>  - update global state (bp, netdev, etc)
>  - pass buffer pointers to the hardware
>  - start interface
>  - free old context.
> 
> The HW disable sequence is inspired by macb_reset_hw() but avoids
> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT.
> 
> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping
> over register writes which would be redundant (because values have not
> changed).
> 
> The generic context swapping parts are isolated into helper functions
> macb_context_swap_start|end(), reusable by other operations (change_mtu,
> set_channels, etc).

> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 81beb67b206a..340ae7d881c6 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
>  	}
>  }
>  
> +static void macb_context_swap_start(struct macb *bp)
> +{
> +	struct macb_queue *queue;
> +	unsigned long flags;
> +	unsigned int q;
> +	u32 ctrl;
> +
> +	/* Disable software Tx, disable HW Tx/Rx and disable NAPI. */
> +
> +	netif_tx_disable(bp->netdev);

AFAIR netif_tx_disable() just stops all the queues, if the NAPIs and
whatever else may wake queues is still running the queues may get
restarted right away.

> +	spin_lock_irqsave(&bp->lock, flags);
> +
> +	ctrl = macb_readl(bp, NCR);
> +	macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE)));
> +
> +	macb_writel(bp, TSR, -1);
> +	macb_writel(bp, RSR, -1);
> +
> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> +		queue_writel(queue, IDR, -1);
> +		queue_readl(queue, ISR);
> +		if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
> +			queue_writel(queue, ISR, -1);
> +	}
> +

^ permalink raw reply

* Re: [PATCH net-next v2 09/14] net: macb: change caps helpers signatures
From: Jakub Kicinski @ 2026-04-14  0:47 UTC (permalink / raw)
  To: Théo Lebrun
  Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
	Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
	Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
	Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
	linux-kernel
In-Reply-To: <20260410-macb-context-v2-9-af39f71d40b6@bootlin.com>

On Fri, 10 Apr 2026 21:51:57 +0200 Théo Lebrun wrote:
> For parallel MACB context to start become a reality, many functions will
> soon not have access to `struct macb *bp`. Those will still have access
> to caps through ctx->info->caps.
> 
> Change all caps helpers signatures, from taking `struct macb *bp` to
> taking `u32 caps`.

Subjectively I feel like this is a slight loss of type safety.
Someone may pass the wrong u32 and compiler will not help?

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net] idpf: fix double free and use-after-free in aux device error paths
From: Jacob Keller @ 2026-04-14  0:46 UTC (permalink / raw)
  To: Greg Kroah-Hartman, intel-wired-lan, netdev@vger.kernel.org,
	Jakub Kicinski
  Cc: netdev, linux-kernel, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	stable
In-Reply-To: <2026041116-retail-bagginess-250f@gregkh>

On 4/11/2026 3:12 AM, Greg Kroah-Hartman wrote:
> When auxiliary_device_add() fails in idpf_plug_vport_aux_dev() or
> idpf_plug_core_aux_dev(), the err_aux_dev_add label calls
> auxiliary_device_uninit() and falls through to err_aux_dev_init.  The
> uninit call will trigger put_device(), which invokes the release
> callback (idpf_vport_adev_release / idpf_core_adev_release) that frees
> iadev.  The fall-through then reads adev->id from the freed iadev for
> ida_free() and double-frees iadev with kfree().
> 
> Free the IDA slot and clear the back-pointer before uninit, while adev
> is still valid, then return immediately.
> 
> Commit 65637c3a1811 65637c3a1811 ("idpf: fix UAF in RDMA core aux dev
> deinitialization") fixed the same use-after-free in the matching unplug
> path in this file but missed both probe error paths.
> 
> Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: stable <stable@kernel.org>
> Fixes: be91128c579c ("idpf: implement RDMA vport auxiliary dev create, init, and destroy")
> Fixes: f4312e6bfa2a ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
> Assisted-by: gregkh_clanker_t1000
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---

This is targeted at [net]. The fix seems straight forward enough.
@Jakub, I have no objections if you want to pull this directly. I am not
sure our validation team will find anything when testing since this is
an error path that is historically been difficult for us to test.

I'm also fine with taking it through iwl-net if you prefer, but just
want to avoid duplicate work if you're already considering it.

> Note, these cleanup paths are messy, but I couldn't see a simpler way
> without a lot more rework, so I choose the simple way :)
> 

Yea, I didn't see a better way either.

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net-next v4 10/10] enic: add V2 VF probe with admin channel and PF registration
From: Jakub Kicinski @ 2026-04-14  0:22 UTC (permalink / raw)
  To: devnull+satishkh.cisco.com
  Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
	linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
	satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-10-f052326c2a57@cisco.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add V2 VF probe with admin channel and PF registration

When a V2 SR-IOV VF probes, open the admin channel, initialize the
MBOX protocol, perform the capability check with the PF, and register
with the PF. This establishes the PF-VF communication path that the PF
uses to send link state notifications.

> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
> index 057716ccc2837..bf4417e67b16d 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
[ ... ]
> @@ -2992,6 +2997,32 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		goto err_out_dev_close;
>  	}
>  
> +	/* V2 VF: open admin channel and register with PF.
> +	 * Must happen before register_netdev so the VF is fully
> +	 * initialized before the interface is visible to userspace.
> +	 */
> +	if (enic_is_sriov_vf_v2(enic)) {
> +		err = enic_admin_channel_open(enic);
> +		if (err) {
> +			dev_err(dev,
> +				"Failed to open admin channel: %d\n", err);
> +			goto err_out_dev_deinit;
> +		}
> +		enic_mbox_init(enic);

Could this ordering cause a race condition with an uninitialized completion object?

enic_admin_channel_open() dynamically allocates the MSI-X interrupt, registers
the IRQ handler, and enables the admin hardware queues, making the interrupt
live immediately.

Only afterward is enic_mbox_init() called, which initializes the
enic->mbox_comp completion struct. If a spurious interrupt or a left-over
admin message causes the admin NAPI poll handler to fire between these two
calls, it will attempt to call complete(&enic->mbox_comp) before the
completion struct's internal lock has been initialized.

Also regarding the admin channel resources, are we properly handling the admin
channel across a soft reset?

When the driver recovers from a Tx hang or queue error via
enic_tx_hang_reset() or enic_reset(), it issues a CMD_SOFT_RESET to the
firmware, which resets and disables all hardware queues on the VNIC. The
recovery path then calls enic_init_vnic_resources(), which only re-initializes
and enables the datapath queues (wq, rq, cq).

It completely ignores the admin channel queues (admin_wq, admin_rq, admin_cq).
As a result, the admin queues remain disabled in hardware and desynchronized
from the software descriptors.

> +		err = enic_mbox_vf_capability_check(enic);
[ ... ]
> @@ -3156,6 +3199,17 @@ static void enic_remove(struct pci_dev *pdev)
>  		cancel_work_sync(&enic->reset);
>  		cancel_work_sync(&enic->change_mtu_work);
>  		unregister_netdev(netdev);
> +		if (enic_is_sriov_vf_v2(enic)) {
> +			if (enic->vf_registered) {
> +				int unreg_err = enic_mbox_vf_unregister(enic);
> +
> +				if (unreg_err)
> +					netdev_warn(netdev,
> +						    "Failed to unregister from PF: %d\n",
> +						    unreg_err);
> +			}
> +			enic_admin_channel_close(enic);
> +		}

Does this unregistration ordering introduce a use-after-free on the freed
netdev?

Because the admin interrupt is not masked by ndo_stop(), it remains fully
active after the network device is shut down. If the PF asynchronously sends a
link state notification after unregister_netdev(netdev) but before
enic_admin_channel_close(enic):

The admin workqueue will process the notification and call
netif_carrier_on(enic->netdev). This adds the device to the lweventlist and
calls dev_hold(dev), bumping dev->pcpu_refcnt.

However, unregister_netdev() has already completed its wait for pcpu_refcnt to
drop to zero. The driver then proceeds to call free_netdev(netdev), freeing
both the pcpu_refcnt allocation and the netdev memory. When the linkwatch
worker subsequently processes the event, it accesses this freed memory.

Should the admin channel be closed before unregister_netdev()?

> diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
> index 436326ace0499..74cd2ee3af5cb 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_res.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_res.c
> @@ -211,7 +211,8 @@ void enic_get_res_counts(struct enic *enic)
>  		vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_RQ) >= 1 &&
>  		vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_CQ) >=
>  			ARRAY_SIZE(enic->admin_cq) &&
> -		vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1;
> +		(enic_is_sriov_vf_v2(enic) ||
> +		 vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1);

Does this configuration lead to interrupt resource exhaustion and probe failures?

enic_adjust_resources() allocates all available interrupts (minus
ENIC_MSIX_RESERVED_INTR) to data queues, which sets enic->intr_count equal to
enic->intr_avail when queue counts are maximized.

However, enic_admin_setup_intr() requires an additional interrupt at index
intr_index = enic->intr_count. This triggers the
intr_index >= enic->intr_avail bounds check and returns -ENODEV, causing the
VF probe to abort.

Additionally, on the PF side, vnic_intr_alloc() hardcodes the resource type to
RES_TYPE_INTR_CTRL, effectively stealing a standard datapath interrupt instead
of utilizing the dedicated RES_TYPE_SRIOV_INTR hardware resource provisioned
by the firmware.

^ permalink raw reply

* Re: [PATCH net-next v4 09/10] enic: wire V2 SR-IOV enable with admin channel and MBOX
From: Jakub Kicinski @ 2026-04-14  0:22 UTC (permalink / raw)
  To: devnull+satishkh.cisco.com
  Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
	linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
	satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-9-f052326c2a57@cisco.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: wire V2 SR-IOV enable with admin channel and MBOX

This commit extends enic_sriov_configure() to handle V2 SR-IOV VFs,
allocating per-VF MBOX state and opening the admin channel before calling
pci_enable_sriov(). It also reverses the order in the disable path.

Note: The format of this report is altered due to recitation restrictions.
The original patch code is not quoted, and a free-form summary of the
observations is provided instead.

In enic_sriov_v2_enable(), enic->num_vfs is set and ENIC_SRIOV_ENABLED is
asserted, but is the enic->pp array reallocated to accommodate the VFs?

If enic->pp remains a 1-element array, could subsequent network device
operations bypass the vf < enic->num_vfs check and calculate an
out-of-bounds pointer for vf > 0 using the ENIC_PP_BY_INDEX macro,
corrupting the heap?

Is enic_sriov_configure() actually registered as the .sriov_configure
callback in the enic_driver structure?

Because the function is marked __maybe_unused, does this leave the sysfs
interface unable to invoke it, making the V2 VF dynamic enablement
unreachable?

When enic_sriov_configure() is called via sysfs, it mutates enic->pp and
enic->num_vfs. Does this happen without holding the rtnl_lock?

If concurrent network device operations running under rtnl_lock validate
the VF index and resolve enic->pp right before the sysfs path calls
kfree(enic->pp), can this result in a use-after-free when the stale pointer
is dereferenced?

In the SR-IOV disable path inside enic_sriov_configure(), the code frees
the existing port profiles array and reallocates a single element with
kzalloc_obj().

If this allocation fails and returns -ENOMEM, enic->pp is left as NULL.
Could a subsequent NDO call targeting the PF pass the enic_is_valid_pp_vf()
check and crash when dereferencing the NULL enic->pp?

When V2 SR-IOV is enabled, enic_sriov_v2_enable() calls enic_mbox_init(),
which unconditionally executes mutex_init(&enic->mbox_lock) and
init_completion(&enic->mbox_comp).

If SR-IOV is disabled and later re-enabled, will re-initializing these
active synchronization primitives over their existing state violate locking
rules?

The commit message states:
"The existing V1/USNIC SR-IOV paths are unchanged."

However, in enic_remove(), the calls to enic_dev_deinit(enic) and
vnic_dev_close(enic->vdev) were moved to execute after pci_disable_sriov().
Additionally, the enic->priv_flags &= ~ENIC_SRIOV_ENABLED cleanup was
dropped.

Could the commit message be updated to reflect this behavioral change to the
hardware teardown sequence, or was this change unintended?

^ permalink raw reply

* Re: [PATCH net-next v4 08/10] enic: add MBOX VF handlers for capability, register and link state
From: Jakub Kicinski @ 2026-04-14  0:22 UTC (permalink / raw)
  To: devnull+satishkh.cisco.com
  Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
	linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
	satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-8-f052326c2a57@cisco.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add MBOX VF handlers for capability, register and link state

This commit implements VF-side mailbox message processing for SR-IOV V2
admin channel communication, including handlers for capability checks,
registration, and link state notifications.

> diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
> index f5784624ebbdf..b5ed31450ee76 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c

[ ... ]

> @@ -280,6 +291,136 @@ static void enic_mbox_pf_process_msg(struct enic *enic,
>  			    hdr->msg_type, vf_id, err);
>  }
>  
> +static void enic_mbox_vf_handle_capability_reply(struct enic *enic,
> +						 void *payload)
> +{
> +	struct enic_mbox_vf_capability_reply_msg *reply = payload;
> +
> +	if (le16_to_cpu(reply->reply.ret_major) == 0)
> +		enic->pf_cap_version = le32_to_cpu(reply->version);
> +	complete(&enic->mbox_comp);
> +}

Does this code handle delayed replies safely? 

If a capability request times out and the VF proceeds to send a register
request, could a delayed capability reply call complete() here and prematurely
wake up the register request? 

Since there is only a single shared completion and no validation of the
currently expected reply type, the new request might wake up, evaluate unmatched
state, and fail incorrectly.

[ ... ]

> +static void enic_mbox_vf_handle_unregister_reply(struct enic *enic,
> +						 void *payload)
> +{
> +	struct enic_mbox_vf_register_reply_msg *reply = payload;

This isn't a bug, but is it intentional to use the register_reply_msg structure
here for an unregister reply? While they currently have the same layout, this
could be brittle if the structures diverge in the future.

> +
> +	if (le16_to_cpu(reply->reply.ret_major)) {
> +		netdev_warn(enic->netdev,
> +			    "MBOX: VF unregister rejected by PF: %u/%u\n",
> +			    le16_to_cpu(reply->reply.ret_major),
> +			    le16_to_cpu(reply->reply.ret_minor));
> +	} else {
> +		enic->vf_registered = false;
> +	}
> +	complete(&enic->mbox_comp);
> +}

[ ... ]

> +	case ENIC_MBOX_VF_UNREGISTER_REPLY: {
> +		size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);

Similarly here, the register_reply_msg size is being used for the unregister
reply validation check.

> +
> +		if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
> +					     payload_len, exp))
> +			return;
> +		enic_mbox_vf_handle_unregister_reply(enic, payload);
> +		break;
> +	}

[ ... ]

> +int enic_mbox_vf_unregister(struct enic *enic)
> +{
> +	int err;
> +
> +	if (!enic->vf_registered)
> +		return 0;
> +
> +	reinit_completion(&enic->mbox_comp);
> +
> +	err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REQUEST,
> +				 ENIC_MBOX_DST_PF, NULL, 0);
> +	if (err)
> +		return err;
> +
> +	err = enic_mbox_wait_reply(enic, 3000);
> +
> +	return enic->vf_registered ? -ETIMEDOUT : 0;
>  }

Is the err value from enic_mbox_wait_reply() intentionally ignored here?

If the PF successfully receives the request but explicitly rejects it,
enic_mbox_wait_reply() returns 0, but enic->vf_registered remains true. This
causes a protocol-level rejection to be conflated with a channel timeout by
returning -ETIMEDOUT.

^ permalink raw reply

* Re: [PATCH net-next v4 07/10] enic: add MBOX PF handlers for VF register and capability
From: Jakub Kicinski @ 2026-04-14  0:21 UTC (permalink / raw)
  To: devnull+satishkh.cisco.com
  Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
	linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
	satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-7-f052326c2a57@cisco.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add MBOX PF handlers for VF register and capability

Implement PF-side mailbox message processing for SR-IOV V2 admin channel
communication. It processes capability, register, and unregister requests
from VFs.

> diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
> index d144c86d9ef84..f5784624ebbdf 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
[ ... ]
> @@ -124,10 +124,168 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
[ ... ]
> +static int enic_mbox_pf_handle_unregister(struct enic *enic, void *msg,
> +					  u16 vf_id, u64 msg_num)
> +{
> +	struct enic_mbox_vf_register_reply_msg reply = {};
> +	int err;
> +
> +	if (!enic->vf_state || vf_id >= enic->num_vfs) {
> +		netdev_warn(enic->netdev,
> +			    "MBOX: unregister from invalid VF %u\n", vf_id);
> +		return -EINVAL;
> +	}
> +
> +	reply.reply.ret_major = cpu_to_le16(0);
> +	err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REPLY, vf_id,
> +				 &reply, sizeof(reply));
> +	if (err)
> +		return err;
> +
> +	enic->vf_state[vf_id].registered = false;

If enic_mbox_send_msg() fails (for example, due to a full admin WQ or channel
timeout), the function returns early and the PF misses setting the registered
state to false.

Since an unregister request indicates the VF driver is unloading or shutting
down, should the PF unconditionally clear the registration state regardless of
whether the acknowledgment successfully reaches the defunct VF?

If the local state is not cleared, could the PF assume the VF is still active
and continue attempting to send it unnecessary link state notifications?

[ ... ]
> +static void enic_mbox_pf_process_msg(struct enic *enic,
> +				     struct enic_mbox_hdr *hdr, void *payload)
> +{
> +	u16 vf_id = le16_to_cpu(hdr->src_vnic_id);
> +	u16 msg_len = le16_to_cpu(hdr->msg_len);
> +	int err = 0;
> +
> +	if (!enic->vf_state) {
> +		netdev_dbg(enic->netdev,
> +			   "MBOX: PF received msg but SRIOV not active\n");
> +		return;
> +	}
> +
> +	if (vf_id >= enic->num_vfs) {
> +		netdev_warn(enic->netdev,
> +			    "MBOX: PF received msg from invalid VF %u\n",
> +			    vf_id);
> +		return;
> +	}

Could a malicious or buggy guest VM use this untrusted VF input to cause a
host denial of service through log spamming?

Because this warning is not rate-limited, an untrusted VF could continuously
flood the PF's admin channel with malformed requests. This might exhaust host
CPU cycles and cause dmesg lock contention.

Should these logging calls in the message parsing path use
netdev_warn_ratelimited() instead?

[ ... ]
> @@ -142,10 +300,22 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
>  		return;
>  	}
>  
> +	msg_len = le16_to_cpu(hdr->msg_len);
> +	if (msg_len < sizeof(*hdr) || msg_len > len) {
> +		netdev_warn(enic->netdev,
> +			    "MBOX: invalid msg_len %u (buf len %u)\n",
> +			    msg_len, len);
> +		return;
> +	}

Similarly here, does this log also need rate limiting to protect the host
from being spammed by malformed headers sent from a VF?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox